In this project, we have tried to develop a model which can take an image as an input and output a sentence that can describe the things in that picture. The model uses the Flickr8 dataset for the training purpose. The components of the method we used are: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and sentence generation. The image is captioned by recognizing the objects that appear in the input image, using automatic feature engineering. Image captioning has a lot of important usage these days as the motivation behind this project comes from several real-life scenarios like- Self driving cars, Aid to the blind etc.
1) What is Image Caption Generator?
Image caption generator is a task that involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English. For a machine to be able to automatically describe objects in an image with its relationships or the work being done using a learned language model is a daunting task, but plays an essential role in many areas. For example, it can help visually impaired people with better under-stand visual input, thereby acting as a facilitator or a guide. The generated image caption should not only contain the image object names, but also their properties, relationships and functions. In addition, the generated caption must be expressed through a natural language such as English.
The model uses the Flickr8 dataset for training purpose. The components of the method we use are: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Sentence Generation. In this paper, CNN is used to create a dense feature vector. This dense vector is also called an embedding. For an image caption model, it becomes a dense representation of the embedding image and is used as the initial state of the LSTM. At each time-step, the LSTM considers the previous cell position and outputs a prediction for the most likely next value in the sequence.
2) Dataset Used
In our project, we have used the Flickr8k image dataset to train the model for understanding how to discover the relation between images and words for generating captions.
It contains 8000 images in JPEG format with different shapes and sizes and each image has 5 different captions. The images are chosen from 6 different Flickr groups, and do not contain any well-known people or locations. These were manually selected to depict a variety of scenes and situations.
The images are bifurcated as follows in the code:
• Training Set — 6000 images
• Dev Set — 1000 images
• Test Set — 1000 images
There are also some text files related to the images. One of the files is “Flickr8k.token.txt” which has each image along with its 5 captions. Every line contains the <image name>#i <caption>, where 0≤i≤4 i.e. the name of the image, caption number (0 to 4) and the actual caption.
Before reading the post, some concepts of Deep Learning like Convolutional Neural Networks, Recurrent Neural Networks, Tranfer Learning, Natural Language Processing must be known to the reader. Also the reader must be familiar with the basic Python syntax, data structures, Keras Library, working on Google Colab etc.
4) Model generation
A merge-model architecture is used in this project to create an image caption generator. In this model, the encoded features of an image are used along with the encoded text data to generate the next word in the caption. In this approach, RNN is used only to encode text data and is not dependent on the features of the image. After the captions have been encoded, those features are then merged with the image vector in another multimodal layer which comes after the RNN encoding layer. This architecture model provides the advantage of feeding preprocessed text data to the model instead of raw data. The model development has the following blocks:
- Image feature extractor
- Text Preprocessor
- Output predictor
- Fitting the Model
- Caption Geneartion
a) Image Feature Extractor
The feature extractor needs an image 224x224x3 size. The model uses ResNet50 pretrained on ImageNet dataset where the features of the image are extracted just before the last layer of classification. Another dense layer is added and converted to get a vector of length 2048.
b) Text Preprocessor
To define the vocabulary, 8253 unique words are tokenized from the training dataset. As computers do not understand English words, we have represented them with numbers and mapped each word of the vocabulary with a unique index value and we encoded each word into a fixed sized vector and represented each word as a number. Also, we maintain a list for each caption that stores the next word at each sub-iteration. Further, one hot encoding is applied on the list that contains the next word. Further, both partial sequence and one hot encoded next word are converted into arrays.
c) Output Predictor
In the training of the model, we first applied the Sequential model, for the images, which contains a Dense layer that uses ‘relu’ as the element-wise activation function. Then we added a Repeat vector layer with argument ‘40’, which would repeat the input 40 times. Then we applied another sequential model, in which we used the Embedding layer as the first layer of the model.
Output vector from both the image feature extractor and the text processor are of same length (128) and a decoder merges both the vectors using an addition operation. This is then fed into two dense layers. The first layer is of length 128 and the second layer makes a prediction of the most probable next word in the caption. This layer uses softmax activation function to predict the most probable next word in the vocabulary.
d) Fitting the Model
After building the model, the model is fit using the training dataset. The model is made to run for 210 epochs and the best model is chosen among the 210 epochs by computing loss function on Flickr8k development dataset. The model with the lowest loss function is chosen for generating captions.
e) Caption Geneartion
To test our trained model, we input an image to the model. Next the image is fed into the feature extractor to recognize what all objects and scenes are depicted in the image, after resizing it. The process of caption generation is done using the RNN trained model. Then for that image, sequentially, word-by-word the caption is generated by selecting the word with maximum weight for the image at that particular iteration. The indexed word is converted to word and then appended into final caption. When tag is detected or the size of the caption reaches 40, the final caption is generated and printed along with the input image.
5) Performance measure
We pictorially represent the value of loss and accuracy at every epoch in the respective two graphs. Loss value would decrease, and accuracy value would increase as we move from lower to higher number of epochs. Also, more the number of epochs, more would be the smoothness of these curves.
6) Results and outputs
Through the testing phase of our implemented model, we found out that the model can generate sensible descriptions of images in valid English sentences. The generated captions are helpful enough to tell about the objects or elements in the image. Such Image Captioning can be helpful to visually impaired people, for image search and autonomous driving system etc. The loss and accuracy of the system has helped us achieve such good results.
Some sample images from within and outside the dataset have been tested and we have got the following results.
7) Conclusion and future scope
Our described model is based on a CNN that encodes an image into a compact representation, followed by an RNN that generates corresponding sentences based on the learned image features. It worked quite well when tested on several images. The captions it generated for the images were quite accurate. But the source of input image also played an important role in feature extraction and hence caption generation. Certain images are not well recognized and we found out that there is, still some scope of improvement. There are certain points which can be incorporated into this model to make it even better like larger dataset, using Beam Search to generate captions, BLEU score can be implemented for performance measurement, text –to –speech technology.
- A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In ACL, 2010. 2
- deeplearning.ai Coursera