Enhancing Image Captioning Using Deep Convolutional Generative
Adversarial Networks

Tarun      Jaiswal; Manju      Pandey; Priyanka      Tripathi

doi:10.2174/0126662558282389231229063607

Abstract

Introduction: Image caption generation has long been a fundamental challenge in the area of computer vision (CV) and natural language processing (NLP). In this research, we present an innovative approach that harnesses the power of Deep Convolutional Generative Adversarial Networks (DCGAN) and adversarial training to revolutionize the generation of natural and contextually relevant image captions.

Method: Our method significantly improves the fluency, coherence, and contextual relevance of generated captions and showcases the effectiveness of RL reward-based fine-tuning. Through a comprehensive evaluation of COCO datasets, our model demonstrates superior performance over baseline and state-of-the-art methods. On the COCO dataset, our model outperforms current state-of-the-art (SOTA) models across all metrics, achieving BLEU-4 (0.327), METEOR (0.249), Rough (0.525) and CIDEr (1.155) scores.

Result: The integration of DCGAN and adversarial training opens new possibilities in image captioning, with applications spanning from automated content generation to enhanced accessibility solutions.

Conclusion: This research paves the way for more intelligent and context-aware image understanding systems, promising exciting future exploration and innovation prospects.

Graphical Abstract

[1]
T. Jaiswal, M. Pandey,  and P. Tripathi, "Image captioning through cognitive iot and machine-learning approaches", Turkish J. Comput. Math. Edu., vol. 12, no. 9, pp. 333-351, 2021.
[2]
X. Yang, H. Zhang, C. Gao,  and J. Cai, "Learning to collocate visual-linguistic neural modules for image captioning", Int. J. Comput. Vis., vol. 131, no. 1, pp. 82-100, 2023.
 [http://dx.doi.org/10.1007/s11263-022-01692-8]
[3]
Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,  and L.D. Jackel, "Backpropagation applied to handwritten zip code recognition", Neural Comput., vol. 1, no. 4, pp. 541-551, 1989.
 [http://dx.doi.org/10.1162/neco.1989.1.4.541]
[4]
S. Kombrink, T. Mikolov, M. Karafiát,  and L. Burget, "Recurrent neural network based language modeling in meeting recognition", In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, pp. 2877-2880 
 [http://dx.doi.org/10.21437/Interspeech.2011-720]
[5]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,  and Y. Bengio, "Generative adversarial networks", Commun. ACM, vol. 63, no. 11, pp. 139-144, 2020.
 [http://dx.doi.org/10.1145/3422622]
[6]
Q. You, H. Jin, Z. Wang, C. Fang,  and J. Luo, "Image captioning with semantic attention", In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, pp. 4561-4659 
 [http://dx.doi.org/10.1109/CVPR.2016.503]
[7]
A. Radford, L. Metz,  and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks", In 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc, 2015. https://api.semanticscholar.org/CorpusID:11758569
[8]
O. Vinyals, A. Toshev, S. Bengio,  and D. Erhan, "Show and tell: A neural image caption generator", In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3165-3164 
 [http://dx.doi.org/10.1109/CVPR.2015.7298935]
[9]
P. Anderson, "Bottom-up and top-down attention for image captioning and visual question answering", In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077-6086 
 [http://dx.doi.org/10.1109/CVPR.2018.00636]
[10]
Q. Zheng, P. Zhao, Y. Li, H. Wang,  and Y. Yang, "Spectrum interference-based two-level data augmentation method in deep learning for automatic modulation classification., vol. Vol. 33", In Neural Comput. Appl., 2020, pp. 7723-7745 
[11]
Q. Zheng, M. Yang, Q. Zhang,  and X. Zhang, "Fine-grained image classification based on the combination of artificial features and deep convolutional activation features", In 2017 IEEE/CIC Int. Conf. Commun, China, 2017, pp. 1-6 
[12]
Q. Zheng, M. Yang, X. Tian, X. Wang,  and D. Wang, "Rethinking the role of activation functions in deep convolutional neural networks for image classification",  Available from: https://api.semanticscholar.org/CorpusID:247593447
[13]
Q. Zheng, Application of wavelet-packet transform driven deep learning method in pm2.5 concentration prediction: A case study of qingdao, china., vol. Vol. 92. Sustainable Cities and Society, 2023, p. 104486.https://api.semanticscholar.org/CorpusID:257224771
 [http://dx.doi.org/10.1016/j.scs.2023.104486]
[14]
Q. Zheng, X. Tian, Z. Yu, H. Wang, A. Elhanashi,  and S. Saponara, "DL-PR: Generalized automatic modulation classification method based on deep learning with priori regularization., vol. Vol. 122", In Eng. Appl. Artif. Intell., 2023, p. 106082  [https://api.semanticscholar.org/CorpusID:257555213]
[15]
Y. Chen, R. Xia, K. Yang,  and K. Zou, "GCAM: Lightweight image inpainting via group convolution and attention mechanism", Int. J. Mach. Learn. Cybern., 2023.
 [http://dx.doi.org/10.1007/s13042-023-01999-z]
[16]
Y. Chen, R. Xia, K. Yang,  and K. Zou, "DARGS: Image inpainting algorithm via deep attention residuals group and semantics", In J. King Saud Univ. Comput. Inf. Sci., vol. 35, 2023, p. 101567 
 [http://dx.doi.org/10.1016/j.jksuci.2023.101567]
[17]
Y. Chen, R. Xia, K. Yang,  and K. Zou, "MFMAM: Image inpainting via multi-scale feature module with attention module", Comput. Vis. Image Underst, 2023. Available from: https://api.semanticscholar.org/CorpusID:265077872
[18]
G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A.C. Berg,  and T.L. Berg, "Babytalk: Understanding and generating simple image descriptions", IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2891-2903, 2013.
 [http://dx.doi.org/10.1109/TPAMI.2012.162] [PMID: 22848128]
[19]
Y. Yang, C.L. Teo, H. Daumé,  and Y. Aloimonos, "Corpus-guided sentence generation of natural images", In EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, year, 2011, pp. 444-454 
[20]
A. Farhadi, "Every Picture Tells a Story: Generating Sentences from Images", in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, pp. 15-29, 2010.
 [http://dx.doi.org/10.1007/978-3-642-15561-1_2]
[21]
V. Ordonez, X. Han, P. Kuznetsova, G. Kulkarni, M. Mitchell, K. Yamaguchi, K. Stratos, A. Goyal, J. Dodge, A. Mensch, H. Daumé III, A.C. Berg, Y. Choi,  and T.L. Berg, "Large scale retrieval and generation of image descriptions", Int. J. Comput. Vis., vol. 119, no. 1, pp. 46-59, 2016.
 [http://dx.doi.org/10.1007/s11263-015-0840-y]
[22]
M. Hodosh, P. Young,  and J. Hockenmaier, "Framing image description as a ranking task: Data, models and evaluation metrics", J. Artif. Intell. Res., vol. 47, pp. 853-899, 2013.
 [http://dx.doi.org/10.1613/jair.3994]
[23]
J. Mao, W. Xu, Y. Yang, J. Wang,  and A.L. Yuille, "Explain images with multimodal recurrent neural networks", In ArXiv, vol. 1410, 2014, p. 1 
[24]
K. Xu, "Show, attend and tell: Neural image caption generation with visual attention", In 32nd Int. Conf. Mach. Learn. ICML 2015, vol. 3, 2015, pp. 2048-2057 
[25]
J. Lu, C. Xiong, D. Parikh,  and R. Socher, "Knowing when to look: Adaptive attention via a visual sentinel for image captioning", In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 
 pp. 3242-3250 year.2017, pp.3247-3250 [http://dx.doi.org/10.1109/CVPR.2017.345]
[26]
I. Schwartz, A.G. Schwing,  and T. Hazan, "High-order attention models for visual question answering", Adv. Neural Inf. Process. Syst., pp. 3665-3675, 2017.
[27]
X. Zhu, L. Li, J. Liu, H. Peng,  and X. Niu, "Captioning transformer with stacked attention modules", Appl. Sci., vol. 8, no. 5, p. 739, 2018.
 [http://dx.doi.org/10.3390/app8050739]
[28]
L. Yu, W. Zhang, J. Wang,  and Y. Yu, "SeqGAN: Sequence generative adversarial nets with policy gradient", In Proc. AAAI Conf. Artif. Intell., vol. 31, 2017, pp. 2852-2858 
 [http://dx.doi.org/10.1609/aaai.v31i1.10804]
[29]
M. Mirza,  and S. Osindero, "Conditional Generative Adversarial Nets", ArXiv1411.1, 2014.
[30]
B. Dai, S. Fidler, R. Urtasun,  and D. Lin, "Towards Diverse and Natural Image Descriptions via a Conditional GAN", In Proceedings of the IEEE International Conference on Computer Vision year, 2017, pp. 2989-2998 
 [http://dx.doi.org/10.1109/ICCV.2017.323]
[31]
C. Wang,  and X. Gu, "Learning joint relationship attention network for image captioning", In Expert Syst. Appl., vol. 211, 2023, p. 118474 
 [http://dx.doi.org/10.1016/j.eswa.2022.118474]
[32]
L. Meng, J. Wang, Y. Yang,  and L. Xiao, "Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning", In IEEE Trans., vol. Vol. 61. Geosci. Remote Sens, 2023, pp. 1-13 
[33]
T. Chen, Z. Li, J. Wu, H. Ma,  and B. Su, "Improving image captioning with Pyramid Attention and SC-GAN., vol. Vol. 117", In Image Vis. Comput., 2021, p. 104340 
[34]
P. Song, D. Guo, J. Zhou, M. Xu,  and M. Wang, "Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning", In IEEE Trans., vol. Vol. 53. Cybern, 2022, pp. 4388-4399 
[35]
J. Yu, H. Li, Y. Hao, B. Zhu, T. Xu,  and X. He, "CgT-GAN: CLIP-guided Text GAN for Image Captioning", Proc. 31st ACM Int. Conf. Multimed, pp. 2252-2263, 2023.
 [http://dx.doi.org/10.1145/3581783.3611891]
[36]
X. Yang, H. Zhang, C. Gao,  and J. Cai, "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning., vol. Vol. 131", In Int. J. Comput. Vis., 2022, pp. 82-100 
[37]
L. Li, S. Tang, Y. Zhang, L. Deng,  and Q. Tian, "GLA: Global–local attention for image description", IEEE Trans. Multimed., vol. 20, no. 3, pp. 726-737, 2018.
 [http://dx.doi.org/10.1109/TMM.2017.2751140]
[38]
K. Simonyan,  and A. Zisserman, "Very deep convolutional networks for large-scale image recognition", In 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc, 2014. 
[39]
X. Jia, E. Gavves, B. Fernando,  and T. Tuytelaars, "Guiding the long-short term memory model for image caption generation", In Proceedings of the IEEE International Conference on Computer Vision year.2015,vol.2015, 2015, pp. 2407-2415 
 [http://dx.doi.org/10.1109/ICCV.2015.277]
[40]
T.Y. Lin, "Microsoft COCO: Common objects in context", In: in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics. springer, 2014, pp. 740-755.
 [http://dx.doi.org/10.1007/978-3-319-10602-1_48]
[41]
A. Karpathy,  and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions", IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 664-676, 2017.
 [http://dx.doi.org/10.1109/TPAMI.2016.2598339] [PMID: 27514036]
[42]
K. Papineni, S. Roukos, T. Ward,  and W.J. Zhu, "BLEU: A method for automatic evaluation of machine translation", In Proceedings of the Annual Meeting of the Association for Computational Linguistics year, 2002, pp. 311-318 
[43]
C.Y. Lin, "Rouge: A package for automatic evaluation of summaries", In Proceedings of the workshop on text summarization branches out (WAS 2004) year., 2004, pp. 25-26 
[44]
S. Banerjee,  and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgmentsin", In IEEvaluation@ ACL., 2005, pp. 65-72 
[45]
R. Vedantam, C.L. Zitnick,  and D. Parikh, "CIDEr: Consensus-based image description evaluation", In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) year., 2015, pp. 4566-4575 
 [http://dx.doi.org/10.1109/CVPR.2015.7299087]
[46]
J. Donahue, L.A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko,  and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description", IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677-691, 2017.
 [http://dx.doi.org/10.1109/TPAMI.2016.2599174] [PMID: 27608449]
[47]
L. Chen, "SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning", In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) year., 2017, pp. 6298-6306 
 [http://dx.doi.org/10.1109/CVPR.2017.667]
[48]
Z. Gan, "Semantic compositional networks for visual captioning", In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) year., 2017, pp. 1141-1150 
 [http://dx.doi.org/10.1109/CVPR.2017.127]
[49]
W. Nanal,  and M. Hajiarbabi, "Captioning remote sensing images using transformer architecture", In 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) year., 2023, pp. 413-418 
 [http://dx.doi.org/10.1109/ICAIIC57133.2023.10067039]
[50]
L. Maxwell, "Controllable image captioning", ArXiv/2204.1, 2022. http://arxiv.org/abs/2204.13324
[51]
S. Amirian, K. Rasheed, T.R. Taha,  and H.R. Arabnia, "Image Captioning with Generative Adversarial Network", In 2019 International Conference on Computational Science and Computational Intelligence (CSCI) year., 2019, pp. 272-275 
 [http://dx.doi.org/10.1109/CSCI49370.2019.00055]
[52]
H. Katpally,  and A. Bansal, "Ensemble learning on deep neural networks for image caption generation", In in Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC year., 2020, pp. 61-68 
 [http://dx.doi.org/10.1109/ICSC.2020.00016]
[53]
S. Ding, S. Qu, Y. Xi,  and S. Wan, "Stimulus-driven and concept-driven analysis for image caption generation", Neurocomputing, vol. 398, pp. 520-530, 2020.
 [http://dx.doi.org/10.1016/j.neucom.2019.04.095]
[54]
M. Z. Hossain, F. Sohel, M. F. Shiratuddin,  and H. Shiratuddin, "Text to image synthesis for improved image captioning", In IEEE Access year., vol. 9, 2021, pp. 64918-64928 
[55]
Y. Zhou, W. Tao,  and W. Zhang, "Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning", In  ICASSP 2021 - 2021 IEEE Int.. Conf. Acoust. Speech Signal Process, 2021, pp. 7598-7602 
[56]
L. Zhenxian, F. Feirong, Y. Xiaobao,  and D. Chen, "An efficient image captioning method based on generative adversarial networks", In Proc. 2021 4th Int. Conf. Artif. Intell. Pattern Recognit., 2021 year., 2021, pp. 374-379 
 [http://dx.doi.org/10.1145/3488933.3488941]

Rights & Permissions Print Cite

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/0126662558282389231229063607	Print ISSN 2666-2558
Publisher Name Bentham Science Publisher	Online ISSN 2666-2566

Recent Advances in Computer Science and Communications

Enhancing Image Captioning Using Deep Convolutional Generative Adversarial Networks

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract