Automated image captioning with deep neural networks

The ImageNet Challenge every year starting from 2010 and the MSCOCO Image Captioning Challenge in 2015 has given rise to new state-of-the-art performances in the image classification task. Through these competitions, we have introduced to Krizhevsky et al. [1], their network “AlexNet” is used in the computer vision problems. Following this in 2014, we were introduced to even deeper and wider networks such as VGGNet [2] and GoogLeNet [3]. The GoogLeNet introduced a new architecture called Inception, which started a new era in the field of image classification, which is also a fundamental block in this image captioning model. The most significant impact of GoogLeNet is its cheap use of computational power than its predecessor networks. On top of that, it has made Transfer Learning possible for large scale image classification given a very limited time, computation capacity, and cost-constrained environment. ARTICLE INFO ABSTRACT


Introduction
The ImageNet Challenge every year starting from 2010 and the MSCOCO Image Captioning Challenge in 2015 has given rise to new state-of-the-art performances in the image classification task. Through these competitions, we have introduced to Krizhevsky et al. [1], their network "AlexNet" is used in the computer vision problems. Following this in 2014, we were introduced to even deeper and wider networks such as VGGNet [2] and GoogLeNet [3]. The GoogLeNet introduced a new architecture called Inception, which started a new era in the field of image classification, which is also a fundamental block in this image captioning model. The most significant impact of GoogLeNet is its cheap use of computational power than its predecessor networks. On top of that, it has made Transfer Learning possible for large scale image classification given a very limited time, computation capacity, and cost-constrained environment.

Article history
Received 01 June 2019 Revised 11 June 2019 Accepted 10 January 2020 Generating natural language descriptions of the content of an image automatically is a complex task. Though it comes naturally to humans, it is not the same when making a machine do the same. But undoubtedly, achieving this feature would remarkably change how machines interact with us. Recent advancement in object recognition from images has led to the model of captioning images based on the relation between the objects in it. In this research project, we are demonstrating the latest technology and algorithms for automated caption generation of images using deep neural networks. This model of generating a caption follows an encoder-decoder strategy inspired by the language-translation model based on Recurrent Neural Networks (RNN). The language translation model uses RNN for both encoding and decoding, whereas this model uses a Convolutional Neural Networks (CNN) for encoding and an RNN for decoding. This combination of neural networks is more suited for generating a caption from an image. The model takes in an image as input and produces an ordered sequence of words, which is the caption.

18
Science in Information Technology Letters ISSN 2722-4139 Vol. 1., No. 1, May 2020, pp. 17-23 After the introduction of the first Inception model, a lot of research started based on it, which led to a more robust and efficient version of the Inception model, commonly known as InceptionV2, InceptionV3, etc. This InceptionV3 model does the heavy lifting of recognizing different objects in an image, but for generating caption, a model must also explain how the objectivity between objects and their attributes and activities. In short, the relationship between these objects needs to be explained using natural language (like English), which makes the presence of language models a must-have requirement.
The architecture, followed by the winner of the 2015 MSCOCO Image Captioning Challenge, uses a single joint model that takes image I as input. The input is trained to maximize the order of the corresponding target words, which are each word from a particular dictionary that describes the image. The machine translation becomes the inspiration of this model, which converts source language sentences (S) into maximum target language translations (T). Translation using Recurrent Neural Networks (RNNs) is a simple method [4]- [6] and can achieve sophisticated performance. An 'encoder' RNN reads an S. The next step is to convert it into a fixed-length vector representation. It is used in the initial hidden state of the RNN decoder that produces T. In this study, and the RNN encoder is changed with a deep Convolutional Neural Network (CNN). CNN performed the same task as the previous RNN [7]. That is the reason for using CNN as an 'encoder' of images. The process of using pre-training to classify images and use the last hidden layer as input to the RNN decoder that generates sentences. This model is called the 'Neural Image Caption' (NIC) by the winners of the challenge.
The MNIST [8] was first introduced in the year 1998, which started the whole new era of image classification. Though in the beginning, it took a week-long to train even this dataset, we have come a long way from that in almost two decades. Now MNIST is used to teach different Machine Learning approaches to an image classification problem in a classroom environment. It can be trained in less than an hour. Following the success of data-driven approaches, newer and more detailed datasets were being produced. Introduction of CIFAR-10 [9] in 2009 and ImageNet [3] in 2014 was a big milestone in pushing forward the research on image classification. ImageNet consists of 1.2 million images distributed into 1000 categories. These image datasets are mostly meant for single object recognition in the given image. Soon more detailed dataset of images was available such as the MSCOCO [10] dataset, which has general caption text associated with the images. A significant thing about some of these datasets are that they are active and still growing.
Compared to image datasets, text data were and are more abundant. Text data has an inbuilt grammar in it, which helps to extract and mine specific text data. Which is why natural language processing and other related fields have advanced so far, credit goes to all the available text corpuses. The task of object recognition with a fully-connected network is that it ignores any spatial information about the image. It gives the same importance to all the pixels present in the image regardless of the relation with any neighboring pixels. Because of this, if the object in the image is not centralized, it fails to classify. This issue is resolved with convolutional neural networks [11] as it takes spatial information of the given image into consideration. This is done through three key concepts, which are local receptive fields, shared weights & bias, and pooling.
Among the latest architecture in object recognition, GoogLeNet, with its inception design, outperforms all the previous architectures. There have many optimizations on the Inception Model [12] that know its relative version number. In this paper, Inception V3 has been used to perform the task of classifying several objects, where natural language descriptions can enhance the generation of images from visual data in currents times. The Recognition, object detection, and advances in language generation systems are the triggers for all these successes. The detection of Farhadi et al. [13] able to convert a triplet of scene elements into text. The process uses powerful language models based on language parsing. These can be used for pictures 'in the wild'. However, it is designed by hand and is not too general for text creation.
Many works that discuss the completion of the image given from the problem of ranking descriptions [14]- [16]. Embedding images and text in the same vector space becomes the approach used. Later for the description of an image, it fetches the text that is closely positioned in the given vector space. These methods fail to properly describe unseen images. The NIC model creates a single network to describe images. This network is used to classify images using deep convolutional networks [17]. Besides, it is modeled in sequence using a combination of recurrent neural networks [18]. The training process uses RBB with a single "end-to-end" network. This mode is adopted by machine translators [4]- [6]. The difference is that convolutional networks process images from the starting of a sentence.

Method
The proposed NIC is to describe images with a neural and probabilistic framework. The development of statistical machine translation provided a robust sequence model. It can maximize the probability of correct translation by providing input sentences in "end to end" fashion, both the training process and its predictions. That is the reason the NIC model as the model chosen, given the picture (not the input sentence in the source language), the same principle of "translating" into the applied description.
Calculations process the probability of correct descriptions as in (1). * = arg max ∑ log ( | ; ) Where is the model parameter, I is an image, and S is correct caption. S represents any sentence has no length restriction. It is why the chain rule is used to model a joint probability of more than 0 , 1 , … , where N is the length of this particle example as (2).
For f, a Long-Short Term Memory (LSTM) net was used, For f a Long-Short Term Memory (LSTM) net was used, which has sophisticated work in translating sequentially.
Representation of the input image is done with a CNN. They are widely used and currently are state-of-the-art for object detection and recognition. The version of CNN used in this demonstration is the one that performed best on the ILSVRC 2014 classification competition [18]. It is based on the Inception V3 architecture. This model can generalize other tasks by means of transfer learning [19]. The associated words in this demonstration are represented in an embedding model [20].  The blue color in LSTM memory indicates connections unrolled, and they correspond to the recurrent connections in Fig. 1. All LSTMs share the same parameters.

Datasets
The training dataset consisted of images and sentences in English describing those images. A total amount of 82783 samples from the MSCOCO dataset were used. Another 40775 samples were separated for the test.
The validation dataset consists of 40504 samples from the COCO-4k dataset, and the result is based on this.

Evaluation Metrics
Not all translation processes match the correct sentence from the source language to the target language, as there can be many correct answers to that. In this scenario, evaluation is asking the raters to judge the usefulness of each description of the image subjectively, but obviously, this is time-consuming. For evaluation of the NIC model, both subjective and automatic metrics were reinforced and was shown that there is indeed some correlation between these two scorings. The process is following the guidelines in [21]. The graders assess each sentence produced with a range of 1 to 4 (shown in Table 1). The matrix used in the image description is the BLUE score [22]. A BLUE score is a precise form of the word n-grams between the sentence produced and the reference. Besides, the confusion model calculates the geometric mean of the inverse probability of each word prediction. More recently, CIDER [23] was introduced in the use of the MSCOCO image writing challenge organization There are Metrics the validation was run on ( Table 2).

Conclusion
There were many limitations in carrying out this work due to a lack of resources. Because of the advancement in transfer learning, it was possible only to train the last layer of the NIC model and do the demonstration. The scale in which machine learning research is going on right now it is fundamental to have a strong resource background to carry out tests after tuning different hyperparameters of any model. The results current models are producing are very promising, and it knows no indication of slowing down. This is just the beginning of the coming automation. NIC models so far can generate general descriptions of images. For future work, it can be made more capable in targeted descriptions.