Bottom-up top-down attention capturing/vqa

Image captioning can be regarded as a translation task from a 2d-array of pixels (the image) to a sequence of words (the caption). In the era of deep learning this is done by training a convolutional neural network (CNN) to encode the image into some latent representation and a recurrent neural network (RNN) to decode the latent representation into a caption. This is done in an end-to-end manner. However, a single monolithic vector that encodes the entire image fails to capture salient features that we may wish to describe in words. The bottom-up top-down (BUTD) model encodes the image using a set of bottom-up features derived from a region detector and a top-down attention mechanism that focuses on different regions during generation of the caption. The resulting model produced state-of-the-art results on the COCO captioning challenge and is still the baseline against which many captioning models are measured against.



Comments Closed