top of page

Image Captioning

Encoder-Decoder network trained on the COCO-2015 dataset. 

Co-Developer: Sohail Lokhandwala, Shuyu Wang. 

Encoder

The encoder is either a custom convolutional network or the ResNet-50 pre-trained on ImageNet. 

For the custom model, we add a drop-out layer that randomly disable half of the hidden units to improve generalization. The embedding size of the network is switched to 512 from 300. In this way, we enlarge the number of unique orthogonal feature representations for the model vocabulary, so the network is able to generate captions with a wider range of words. 

Decoder

The decoder is a custom Long Short Term Memory network.

It is trained with teacher-forcing. That is, we feed each character of the label caption into the network. 

It generates captions deterministically or stochastically, with a temperature that enables a probability distribution according to the temperature value. A very small temperature parameter is almost deterministic. A large temperature would foster random words that has low occurrences.  

Check Out Our Report

For more details regarding the structure of the network and the model performance, take a look at our report using the link below. 

bottom of page