Automated image captioning with deep neural networks

(1) Abdullah Ahmad Zarir Mail (Department of Computer Science, International Islamic University Malaysia, Malaysia)
(2) Saad Bashar Mail (Department of Computer Science, International Islamic University Malaysia, Malaysia)
(3) * Amelia Ritahani Ismail Mail (Department of Computer Science, International Islamic University Malaysia, Malaysia)
*corresponding author

Abstract


Generating natural language descriptions of the content of an image automatically is a complex task. Though it comes naturally to humans, it is not the same when making a machine do the same. But undoubtedly, achieving this feature would remarkably change how machines interact with us. Recent advancement in object recognition from images has led to the model of captioning images based on the relation between the objects in it. In this research project, we are demonstrating the latest technology and algorithms for automated caption generation of images using deep neural networks. This model of generating a caption follows an encoder-decoder strategy inspired by the language-translation model based on Recurrent Neural Networks (RNN). The language translation model uses RNN for both encoding and decoding, whereas this model uses a Convolutional Neural Networks (CNN) for encoding and an RNN for decoding. This combination of neural networks is more suited for generating a caption from an image. The model takes in an image as input and produces an ordered sequence of words, which is the caption.

Keywords


Recurrent Neural Networks; Convolutional Neural Networks; Image Captioning

   

DOI

https://doi.org/10.31763/sitech.v1i1.31
      

Article metrics

10.31763/sitech.v1i1.31 Abstract views : 784 | PDF views : 226

   

Cite

   

Full Text

Download

References


A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv Prepr. arXiv1409.1556, Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.1556.

O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/s11263-015-0816-y.

K. Cho et al., “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” arXiv Prepr. arXiv1406.1078, Jun. 2014, [Online]. Available: http://arxiv.org/abs/1406.1078.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv Prepr. arXiv1409.0473, Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.0473.

I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3104–3112.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” arXiv Prepr. arXiv1312.6229, Dec. 2013, [Online]. Available: http://arxiv.org/abs/1312.6229.

Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” 1998. http://yann.lecun.com/exdb/mnist/.

A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” 2014. https://www.cs.toronto.edu/~kriz/cifar.html.

T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” in European Conference on Computer Vision, Springer International Publishing, 2014, pp. 740–755.

Y. Lecun and Y. Bengio, “Convolutional networks for images, speech, and time-series,” Handb. brain theory neural networks, vol. 3361, no. 10, p. 1995, 1995.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 2818–2826, doi: 10.1109/CVPR.2016.308.

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 1778–1785, doi: 10.1109/CVPR.2009.5206772.

V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2Text: Describing Images Using 1 Million Captioned Photographs,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, 2011, pp. 1143–1151.

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, “Exploring Nearest Neighbor Approaches for Image Captioning,” arXiv Prepr. arXiv1505.04467, May 2015, [Online]. Available: http://arxiv.org/abs/1505.04467.

M. Kolář, M. Hradiš, and P. Zemčík, “Technical Report: Image Captioning with Semantically Similar Images,” arXiv Prepr. arXiv1506.03995, Jun. 2015, [Online]. Available: http://arxiv.org/abs/1506.03995.

S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv Prepr. arXiv1502.03167, Feb. 2015, [Online]. Available: http://arxiv.org/abs/1502.03167.

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.

J. Donahue et al., “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, 2014, pp. 647–655.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv Prepr. arXiv1301.3781, Jan. 2013, [Online]. Available: http://arxiv.org/abs/1301.3781.

M. Hodosh, P. Young, and J. Hockenmaier, “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics,” J. Artif. Intell. Res., vol. 47, pp. 853–899, Aug. 2013, doi: 10.1613/jair.3994.

S. An, T. Bleu, O. G. Hallmark, and E. J. Goetzl, “Characterization of a Novel Subtype of Human G Protein-coupled Receptor for Lysophosphatidic Acid,” J. Biol. Chem., vol. 273, no. 14, pp. 7906–7910, Apr. 1998, doi: 10.1074/jbc.273.14.7906.

R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 4566–4575, doi: 10.1109/CVPR.2015.7299087.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2020 Abdullah Ahmad Zarir, Saad Bashar, Amelia Ritahani Ismail

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
Science in Information Technology Letters
ISSN 2722-4139
Published by Association for Scientific Computing Electrical and Engineering (ASCEE)
W : http://pubs2.ascee.org/index.php/sitech
E : andri@ascee.org, andri.pranolo.id@ieee.org

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

View My Stats