Analogical Reasoning for Visually Grounded Language Acquisition
- URL: http://arxiv.org/abs/2007.11668v1
- Date: Wed, 22 Jul 2020 20:51:58 GMT
- Title: Analogical Reasoning for Visually Grounded Language Acquisition
- Authors: Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang
- Abstract summary: Children acquire language subconsciously by observing the surrounding world and listening to descriptions.
In this paper, we bring this ability to AI, by studying the task of Visually grounded Language Acquisition.
We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning.
- Score: 55.14286413675306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Children acquire language subconsciously by observing the surrounding world
and listening to descriptions. They can discover the meaning of words even
without explicit language knowledge, and generalize to novel compositions
effortlessly. In this paper, we bring this ability to AI, by studying the task
of Visually grounded Language Acquisition (VLA). We propose a multimodal
transformer model augmented with a novel mechanism for analogical reasoning,
which approximates novel compositions by learning semantic mapping and
reasoning operations from previously seen compositions. Our proposed method,
Analogical Reasoning Transformer Networks (ARTNet), is trained on raw
multimedia data (video frames and transcripts), and after observing a set of
compositions such as "washing apple" or "cutting carrot", it can generalize and
recognize new compositions in new video frames, such as "washing carrot" or
"cutting apple". To this end, ARTNet refers to relevant instances in the
training data and uses their visual features and captions to establish
analogies with the query image. Then it chooses the suitable verb and noun to
create a new composition that describes the new image best. Extensive
experiments on an instructional video dataset demonstrate that the proposed
method achieves significantly better generalization capability and recognition
accuracy compared to state-of-the-art transformer models.
Related papers
- Unveiling the Invisible: Captioning Videos with Metaphors [43.53477124719281]
We introduce a new Vision-Language (VL) task of describing the metaphors present in the videos in our work.
To facilitate this novel task, we construct and release a dataset with 705 videos and 2115 human-written captions.
We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task.
arXiv Detail & Related papers (2024-06-07T12:32:44Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments.
We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control.
EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z) - Implicit and Explicit Commonsense for Multi-sentence Video Captioning [33.969215964292395]
We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
arXiv Detail & Related papers (2023-03-14T00:19:11Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.