Connecting What to Say With Where to Look by Modeling Human Attention
Traces
- URL: http://arxiv.org/abs/2105.05964v1
- Date: Wed, 12 May 2021 20:53:30 GMT
- Title: Connecting What to Say With Where to Look by Modeling Human Attention
Traces
- Authors: Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi,
Vikas Singh, Amy Bearman
- Abstract summary: We introduce a unified framework to jointly model images, text, and human attention traces.
We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given only an image.
- Score: 30.8226861256742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a unified framework to jointly model images, text, and human
attention traces. Our work is built on top of the recent Localized Narratives
annotation framework [30], where each word of a given caption is paired with a
mouse trace segment. We propose two novel tasks: (1) predict a trace given an
image and caption (i.e., visual grounding), and (2) predict a caption and a
trace given only an image. Learning the grounding of each word is challenging,
due to noise in the human-provided traces and the presence of words that cannot
be meaningfully visually grounded. We present a novel model architecture that
is jointly trained on dual tasks (controlled trace generation and controlled
caption generation). To evaluate the quality of the generated traces, we
propose a local bipartite matching (LBM) distance metric which allows the
comparison of two traces of different lengths. Extensive experiments show our
model is robust to the imperfect training data and outperforms the baselines by
a clear margin. Moreover, we demonstrate that our model pre-trained on the
proposed tasks can be also beneficial to the downstream task of COCO's guided
image captioning. Our code and project page are publicly available.
Related papers
- A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models [17.144311122664508]
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior.
We propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images.
arXiv Detail & Related papers (2025-02-19T18:35:43Z) - Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs.
We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image.
We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2.
We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Neural Twins Talk [0.0]
We introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model.
Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image.
We report the results of our experiments in three image captioning tasks on the COCO dataset.
arXiv Detail & Related papers (2020-09-26T06:58:58Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.