Image Captioning

Encoder-Decoder network trained on the COCO-2015 dataset.

Co-Developer: Sohail Lokhandwala, Shuyu Wang.

Encoder

The encoder is either a custom convolutional network or the ResNet-50 pre-trained on ImageNet.

For the custom model, we add a drop-out layer that randomly disable half of the hidden units to improve generalization. The embedding size of the network is switched to 512 from 300. In this way, we enlarge the number of unique orthogonal feature representations for the model vocabulary, so the network is able to generate captions with a wider range of words.

Decoder

The decoder is a custom Long Short Term Memory network.

It is trained with teacher-forcing. That is, we feed each character of the label caption into the network.

It generates captions deterministically or stochastically, with a temperature that enables a probability distribution according to the temperature value. A very small temperature parameter is almost deterministic. A large temperature would foster random words that has low occurrences.

Check Out Our Report

For more details regarding the structure of the network and the model performance, take a look at our report using the link below.

Learn More

Best Custom-CNN Encoder-Decoder Performance

COCO_val2014_000000393056.jpg

Ground Truth Captions:

a man wearing a wet suit riding the wave

A person riding a wave on top of a surfboard.

A surfer rides a small wave in the ocean.

a man riding a wave on a blue ocean

A young man ridding a small wave on a surfboard.

Temp = 0.4 (default):

a man riding a wave on a surfboard.

Deterministic:

a man riding a wave on a surfboard.

Temp = 0.001:

a man riding a wave on a surfboard.

Temp = 5:

exposure find status jester panorama extra wheelchairs fixtures issue early tabl stop hitch eaten showed tay rotting platforms sock determined

Best ResNet Encoder-Decoder Performance

COCO_val2014_000000075285.jpg

Ground Truth Captions:

A desk with a keyboard , mouse and computer monitor.

An empty desk chair pushed up to a small computer desk.

A keyboard , mouse and computer monitor on a desk with a laptop.

A table with a computer and desk chair along a wall under a window.

A computer desk that also has a laptop on it.

Temp = 0.4 (default):

a table with a computer and a laptop monitor on it.

Deterministic:

a desk with a laptop and a laptop on it

Temp = 0.001:

a desk with a laptop and a laptop on it

Temp = 5:

completed average certain warplane telephone stencils shown lamp beef round bar ponytail majestic pain grocery braces bathrooms session initial pre

Best Custom-CNN Encoder-Decoder Performance

COCO_val2014_000000223289.jpg

Ground Truth Captions:

Close up of a plate with food on it.

A piece of fish on a sandwich next to a lemon.

This is a fish sandwich on a bun with a slice of lemon on the side of the plate.

A sandwich that is on a plate with the top piece of bread off.

A plate that has bread and chicken on it.

Temp = 0.4 (default):

a plate of food with a sandwich and it.

Deterministic:

a plate of food with a sandwich and a fork.

Temp = 0.001:

a plate of food with a sandwich and a fork.

Temp = 5:

11:30 salami index clothes pantry walled boxer hyena cubicles piping scissors laptops overhand colorized sharpies necklaces sparkler drifting blackberries potato

Best ResNet Encoder-Decoder Performance

COCO_val2014_000000082715.jpg

Ground Truth Captions:

A person that is surfing in the water.

A man on a surfboard surfing a wave in the ocean.

A man riding a wave on a surfboard.

a surfer in a white shirt surfing on a sunny day

A man on a surfboard riding an ocean wave.

Temp = 0.4 (default):

a man riding a wave on the ocean.

Deterministic:

a man riding a wave on a surfboard.

Temp = 0.001:

a man riding a wave on a surfboard.

Temp = 5:

bathtubs crumbling rotten flag acrobatically collapse cashews multi-colored dc fliers stretching tourists napkins band beverages and/or soaring shoulder obstructed brighten

Best Custom-CNN Encoder-Decoder Performance

COCO_val2014_000000278095.jpg

Ground Truth Captions:

A man power sliding on a long board

A young man sitting on his skateboard touching the ground

A man riding a skateboard down a street.

Young man with skateboard appearing like he just fell down.

A man doing a trick on a skateboard in the middle of the street.

Temp = 0.4 (default):

a man standing a skateboard riding a park lot.

Deterministic:

a man is riding a skateboard on a ramp.

Temp = 0.001:

a man is riding a skateboard on a ramp.

Temp = 5:

wineglasses cooks snowshoes albeit really groupe heavy expose score chid handcuffs modular drenched bib demands obstacles cranberry circuit customers cookers

Best ResNet Encoder-Decoder Performance

COCO_val2014_000000063950.jpg

Ground Truth Captions:

A black and white cat is sitting in the sink.

A black and white cat laying in a bathroom sink.

A cat laying in a white sink next to a toilet.

Black and White cat lays inside of the sink

The cat is relaxing in the bathroom sink.

Temp = 0.4 (default):

a cat is sitting in a bathroom sink.

Deterministic:

a cat is sitting on a toilet in a bathroom

Temp = 0.001:

a cat is sitting on a toilet in a bathroom.

Temp = 5:

bountiful count chests jeweled wooded coop includes toothpick bagels wristwatch spending current clamp rue him frolicking await tech paraphernalia oddly