Discriminatory Image Caption Generation Based on Recurrent Neural Networks and Ranking Objective

Authors

  • Geetika Dept. of CSE, National Institute of Technology, Kurukshetra, India
  • Jain T Dept. of CSE, Indian Institute of Technology, Delhi, India

DOI:

https://doi.org/10.26438/ijcse/v5i10.260265

Keywords:

Visual Geometry Group, Long Short Term Memory, Ranking Objective, Image Captioning

Abstract

This paper proposes a novel approach for image caption generation. Being able to describe the content of an image in natural language sentences is a challenging task, but it could have great impact because great amount of resources is required to meet the demands of vast availability of image dataset. The growing importance of image captioning is commensurate with requirement of image based searching, image understanding for visual impaired person etc. In this paper, we develop a model based on deep recurrent neural network that generates brief statement to describe the given image. Our models use a convolutional neural network (CNN) to extract features from an image. We used ranking objective to pay attention to subtle difference between the similar images to generate discriminatory captions. MS COCO dataset is used, nearly half of the dataset for training and one fourth of dataset for each validation and testing. For every image five captions are provided to train the model Our model consistently outperforms other models with on ranking objective. We evaluated our model based on BLEU, METEOR and CIDEr scores.

References

Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation”, arXiv preprint arXiv:1406.1078, 2014.

Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel, “Multimodal neural language models”. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 595–603, 2014

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn)”, arXiv preprint arXiv:1412.6632, 2014.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel, “Unifying visual-semantic embeddings with multimodal neural language models”, arXiv preprint arXiv:1411.2539, 2014.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, “Show and tell: A neural image caption generator”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.

Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database”. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.

Hochreiter, Sepp, and Jrgen Schmidhuber, “Long Short-Term Memory”, Neural Computation 9.8 (1997): 1735-780. Web. 23 Apr. 2016

Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick, “Microsoft COCO: Common Objects in Context” Computer Vision ECCV 2014 Lecture Notes in Computer Science (2014): 740-55. Web. 27 May 2016

Papineni, Kishore, Salim Roukos, ToddWard, Wei-Jing Zhu, Bleu: a method for automatic evaluation of machine translation” Proceedings of the 40th Annual Meeting on Association for Computation Linguistics (ACL): 311-318 (2002). Web. 24 May 2016

Karpathy, Andrej, and Li Fei-Fei, “Deep Visual-semantic Alignments for Generating Image Descriptions” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015). Web. 29 May 2016

Chen, Xinlei and C. Lawrence Zitnick, “Learning a Recurrent Visual Representation for Image Caption Generation”, CoRR abs/1411.5654 (2014). Web. 19 May 2016

Donahue, Jeff, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015). Web. 20 Apr. 2016

Downloads

Published

2025-11-12
CITATION
DOI: 10.26438/ijcse/v5i10.260265
Published: 2025-11-12

How to Cite

[1]
Geetika and T. Jain, “Discriminatory Image Caption Generation Based on Recurrent Neural Networks and Ranking Objective”, Int. J. Comp. Sci. Eng., vol. 5, no. 10, pp. 260–265, Nov. 2025.

Issue

Section

Research Article