Image Caption Generator using ResNet50 and LSTM model

Written by Rupam Goyal , Tushar Gupta , Phalak sharma

In this project, we have tried to develop a model which can take an image as an input and output a sentence that can describe the things in that picture. The model uses the Flickr8 dataset for the training purpose. The components of the method we used are: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and sentence generation. The image is captioned by recognizing the objects that appear in the input image, using automatic feature engineering. Image captioning has a lot of important usage these days as the motivation behind this project comes from several real-life scenarios like- Self driving cars, Aid to the blind etc.

1) What is Image Caption Generator?

The model uses the Flickr8 dataset for training purpose. The components of the method we use are: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Sentence Generation. In this paper, CNN is used to create a dense feature vector. This dense vector is also called an embedding. For an image caption model, it becomes a dense representation of the embedding image and is used as the initial state of the LSTM. At each time-step, the LSTM considers the previous cell position and outputs a prediction for the most likely next value in the sequence.

2) Dataset Used

It contains 8000 images in JPEG format with different shapes and sizes and each image has 5 different captions. The images are chosen from 6 different Flickr groups, and do not contain any well-known people or locations. These were manually selected to depict a variety of scenes and situations.

The images are bifurcated as follows in the code:

• Training Set — 6000 images

• Dev Set — 1000 images

• Test Set — 1000 images

There are also some text files related to the images. One of the files is “Flickr8k.token.txt” which has each image along with its 5 captions. Every line contains the <image name>#i <caption>, where 0≤i≤4 i.e. the name of the image, caption number (0 to 4) and the actual caption.

An example from dataset

3) Prerequisites

4) Model generation

  • Image feature extractor
  • Text Preprocessor
  • Output predictor
  • Fitting the Model
  • Caption Geneartion

a) Image Feature Extractor

The feature extractor needs an image 224x224x3 size. The model uses ResNet50 pretrained on ImageNet dataset where the features of the image are extracted just before the last layer of classification. Another dense layer is added and converted to get a vector of length 2048.

b) Text Preprocessor

To define the vocabulary, 8253 unique words are tokenized from the training dataset. As computers do not understand English words, we have represented them with numbers and mapped each word of the vocabulary with a unique index value and we encoded each word into a fixed sized vector and represented each word as a number. Also, we maintain a list for each caption that stores the next word at each sub-iteration. Further, one hot encoding is applied on the list that contains the next word. Further, both partial sequence and one hot encoded next word are converted into arrays.

c) Output Predictor

In the training of the model, we first applied the Sequential model, for the images, which contains a Dense layer that uses ‘relu’ as the element-wise activation function. Then we added a Repeat vector layer with argument ‘40’, which would repeat the input 40 times. Then we applied another sequential model, in which we used the Embedding layer as the first layer of the model.

Output vector from both the image feature extractor and the text processor are of same length (128) and a decoder merges both the vectors using an addition operation. This is then fed into two dense layers. The first layer is of length 128 and the second layer makes a prediction of the most probable next word in the caption. This layer uses softmax activation function to predict the most probable next word in the vocabulary.

d) Fitting the Model

After building the model, the model is fit using the training dataset. The model is made to run for 210 epochs and the best model is chosen among the 210 epochs by computing loss function on Flickr8k development dataset. The model with the lowest loss function is chosen for generating captions.

e) Caption Geneartion

To test our trained model, we input an image to the model. Next the image is fed into the feature extractor to recognize what all objects and scenes are depicted in the image, after resizing it. The process of caption generation is done using the RNN trained model. Then for that image, sequentially, word-by-word the caption is generated by selecting the word with maximum weight for the image at that particular iteration. The indexed word is converted to word and then appended into final caption. When tag is detected or the size of the caption reaches 40, the final caption is generated and printed along with the input image.

5) Performance measure

Graph of loss vs epochs (Loss: 0.241)
Graph of accuracy vs epochs (Accuracy: 90.29%)

6) Results and outputs

Some sample images from within and outside the dataset have been tested and we have got the following results.

7) Conclusion and future scope

8) References

  2. A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In ACL, 2010. 2
  3. Coursera