(44-4) 16 * << * >> * Russian * English * Content * All Issues

Visual preferences prediction for a photo gallery based on image captioning methods
A.S. Kharchevnikova 1, A.V. Savchenko 1

National Research University Higher School of Economics, Nizhny Novgorod, Russia

 PDF, 1375 kB

DOI: 10.18287/2412-6179-CO-678

Pages: 618-626.

Full text of article: Russian language.

Abstract:
The paper considers a problem of extracting user preferences based on their photo gallery. We propose a novel approach based on image captioning, i.e., automatic generation of textual descriptions of photos, and their classification. Known image captioning methods based on convolutional and recurrent (Long short-term memory) neural networks are analyzed. We train several models that combine the visual features of a photograph and the outputs of an Long short-term memory block by using Google's Conceptual Captions dataset. We examine application of natural language processing algorithms to transform obtained textual annotations into user preferences. Experimental studies are carried out using Microsoft COCO Captions, Flickr8k and a specially collected dataset reflecting the user’s interests. It is demonstrated that the best quality of preference prediction is achieved using keyword search methods and text summarization from Watson API, which are 8 % more accurate compared to traditional latent Dirichlet allocation. Moreover, descriptions generated by trained neural models are classified 1 – 7 % more accurately when compared to known image captioning models.

Keywords:
user modeling, image processing, image captioning, convolutional neural networks.

Citation:
Kharchevnikova AS, Savchenko AV. Visual preferences prediction for a photo gallery based on image captioning methods. Computer Optics 2020; 44(4): 618-626. DOI: 10.18287/2412-6179-CO-678.

Acknowledgements:
The work was partly funded within the Academic Fund Program at the National Research University Higher School of Economics (HSE University) in 2019 (grant No 19-04-004) and by the Russian Academic Excellence Project "5-100".

References:

  1. Singhal A, Sinha P, Pant R. Use of deep learning in modern recommendation system: A summary of recent. Source: <https://arxiv.org/abs/1712.07525>.
  2. Demochkin KV, Savchenko AV. Visual product recommendation using neural aggregation network and context gating, J Phys Conf Ser 2019; 1368(3): 032016.
  3. Kharchevnikova AS, Savchenko AV. Neural networks in video-based age and gender recognition on mobile platforms. Opt Mem Neural Network 2018; 27(4): 246-259.
  4. Grechikhin I, Savchenko AV. User modeling on mobile device based on facial clustering and object detection in photos and videos. In: Book: Morales A, Fierrez J, Sánchez J, Ribeiro B, eds. Proceedings of the iberian conference on pattern recognition and image analysis (IbPRIA). Cham: Springer; 2019: 429-440.
  5. Rassadin AG, Savchenko AV. Scene recognition in user preference prediction based on classification of deep embeddings and object detection. In Book: Lu H, et al, eds. Proceedings of international symposium on neural networks (ISNN). Springer Nature Switzerland AG; 2019: 422-430.
  6. Szegedy C. Going deeper with convolutions. Proc CVPR 2015: 1-9.
  7. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H, MobileNets: Efficient convolutional neural networks for mobile vision applications. Source: <https://arxiv.org/abs/1704.04861>.
  8. Wang R. Covariance discriminative learning: A natural and efficient approach to image set classification. IEEE CVPR 2012: 2496-2503.
  9. Wang L, Wang Z, Qiao Y, Van Gool L. Transferring deep object and scene representations for event recognition in still images. Int J Comput Vis 2018; 126(2-4): 390-409.
  10. Xiong Y, Zhu K, Lin D, Tang X. Recognize complex events from static images by fusing deep channels. Proc CVPR 2015: 1600-1609.
  11. Furman YaA, ed. Point fields and group objects [In Russian]. Mosсow: “Fizmatlit” Publisher; 2014. ISBN: 978-5-9221-1604-6.
  12. Vorontsov K, Potapenko A. Additive regularization of topic models. Mach Learn 2015; 101: 303-323.
  13. Rosen-Zvi M. The author-topic model for authors and documents. Proc 20th CUAI 2004: 487-494.
  14. Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993-1022.
  15. Ferrucci DA. Introduction to “this is Watson”. IBM J Res Dev 2012; 56(3.4): 1.
  16. Lally A, Prager J, McCord M, Boguraev B, Patwardhan S, Chu-Carroll J, Question analysis: How Watson reads a clue, IBM J Res Dev 2012; 56(3.4): 2.
  17. Fan J, Kalyanpur A, Gondek D, Ferrucci D. Automatic knowledge extraction from documents. IBM J Res Dev 2012; 56(3.4): 5.
  18. Savchenko AV. Trigonometric series in orthogonal expansions for density estimates of deep image features. Computer Optics 2018; 42(1): 149-158. DOI: 10.18287/2412-6179-2018-42-1-149-158.
  19. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Source: <https://arxiv.org/abs/1409.1556>.
  20. Tanti M, Gatt A, Camilleri KP. Where to put the image in an image caption generator. Nat Lang Eng 2018; 24(3): 467-489.
  21. Wang M, Song L, Yang X, Luo C. A parallel-fusion RNN-LSTM architecture for image caption generation. Proc IEEE ICIP2016: 4448-4452.
  22. Vinyals O, Toshev A, Bengio, Erhan D. Show and tell: A neural image caption generator. Proc IEEE CVPR 2015: 3156-3164.
  23. Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models. Proc ICML 2014: 595-603.
  24. Vijayakumar AK, Cogswell M, Selvaraju R, Sun Q, Lee S, Crandall D, Batra D. Diverse beam search: Decoding diverse solutions from neural sequence models. Source: <https://arxiv.org/abs/1610.02424>.
  25. Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Plank B. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J Artif Intell Res 2016; 55: 409-442.
  26. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick C. Microsoft COCO: Common objects in context. Proc ECCV 2014: 740-755.
  27. Chen X, Fang H, Lin T, Vedantam R, Gupta S, Dollar P, Microsoft COCO captions: Data collection and evaluation server. Source: <https://arxiv.org/abs/1504.00325>.
  28. Sharma P, Ding N, Goodman S, Soricut R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 2018; 1: 2556-2565.
  29. Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 2002: 311-318.
  30. Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. Proc 9th Workshop on Statistical Machine Translation 2014: 376-380.
  31. Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. Proc IEEE CVPR 2015: 4566-4575.
  32. Goldberg Y, Levy O. Word2Vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. Source: <https://arxiv.org/abs/1402.3722>.
  33. Manning CD, Schütze H. Foundations of statistical natural language processing. MIT Press; 1999.
  34. Kharchevnikova AS, Savchenko AV. Convolutional Neural Networks in age/gender video-based recognition. Proceedings of the IV International Conference "Information Technologies and Nanotechnologies" (ITNT 2018). Samara: "Novaja Tehnika" Publisher; 2018: 916-924.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: ko@smr.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20