Bottom-up top-down attention capturing/vqa

Image captioning can be regarded as a translation task from a 2d-array of pixels (the image) to a sequence of words (the caption). In the era of deep learning this is done by training a convolutional neural network (CNN) to encode the image into some latent representation and a recurrent neural network (RNN) to decode the latent representation into a caption. This is done in an end-to-end manner. However, a single monolithic vector that encodes the entire image fails to capture salient features that we may wish to describe in words. The bottom-up top-down (BUTD) model encodes the image using a set of bottom-up features derived from a region detector and a top-down attention mechanism that focuses on different regions during generation of the caption. The resulting model produced state-of-the-art results on the COCO captioning challenge and is still the baseline against which many captioning models are measured against.

Legacy Jam

Comments Closed

Share your favourite memory and be in the running to win a prize!

Tell us if we've forgotten something and be in the running to win a prize!