Connecting What to Say With Where to Look by Modeling Human Attention
Traces
- URL: http://arxiv.org/abs/2105.05964v1
- Date: Wed, 12 May 2021 20:53:30 GMT
- Title: Connecting What to Say With Where to Look by Modeling Human Attention
Traces
- Authors: Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi,
Vikas Singh, Amy Bearman
- Abstract summary: We introduce a unified framework to jointly model images, text, and human attention traces.
We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given only an image.
- Score: 30.8226861256742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a unified framework to jointly model images, text, and human
attention traces. Our work is built on top of the recent Localized Narratives
annotation framework [30], where each word of a given caption is paired with a
mouse trace segment. We propose two novel tasks: (1) predict a trace given an
image and caption (i.e., visual grounding), and (2) predict a caption and a
trace given only an image. Learning the grounding of each word is challenging,
due to noise in the human-provided traces and the presence of words that cannot
be meaningfully visually grounded. We present a novel model architecture that
is jointly trained on dual tasks (controlled trace generation and controlled
caption generation). To evaluate the quality of the generated traces, we
propose a local bipartite matching (LBM) distance metric which allows the
comparison of two traces of different lengths. Extensive experiments show our
model is robust to the imperfect training data and outperforms the baselines by
a clear margin. Moreover, we demonstrate that our model pre-trained on the
proposed tasks can be also beneficial to the downstream task of COCO's guided
image captioning. Our code and project page are publicly available.
Related papers
- Dual Modalities of Text: Visual and Textual Generative Pre-training [35.82610192457444]
We introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images.
Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head.
arXiv Detail & Related papers (2024-04-16T16:36:50Z) - Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs.
We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image.
We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Neural Twins Talk [0.0]
We introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model.
Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image.
We report the results of our experiments in three image captioning tasks on the COCO dataset.
arXiv Detail & Related papers (2020-09-26T06:58:58Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.