Vision and Language: from Visual Perception to Content Creation
- URL: http://arxiv.org/abs/1912.11872v1
- Date: Thu, 26 Dec 2019 14:07:20 GMT
- Title: Vision and Language: from Visual Perception to Content Creation
- Authors: Tao Mei, Wei Zhang, Ting Yao
- Abstract summary: "vision to language" is probably one of the most popular topics in the past five years.
This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision"
- Score: 100.36776435627962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision and language are two fundamental capabilities of human intelligence.
Humans routinely perform tasks through the interactions between vision and
language, supporting the uniquely human capacity to talk about what they see or
hallucinate a picture on a natural-language description. The valid question of
how language interacts with vision motivates us researchers to expand the
horizons of computer vision area. In particular, "vision to language" is
probably one of the most popular topics in the past five years, with a
significant growth in both volume of publications and extensive applications,
e.g., captioning, visual question answering, visual dialog, language
navigation, etc. Such tasks boost visual perception with more comprehensive
understanding and diverse linguistic representations. Going beyond the
progresses made in "vision to language," language can also contribute to vision
understanding and offer new possibilities of visual content creation, i.e.,
"language to vision." The process performs as a prism through which to create
visual content conditioning on the language inputs. This paper reviews the
recent advances along these two dimensions: "vision to language" and "language
to vision." More concretely, the former mainly focuses on the development of
image/video captioning, as well as typical encoder-decoder structures and
benchmarks, while the latter summarizes the technologies of visual content
creation. The real-world deployment or services of vision and language are
elaborated as well.
Related papers
- Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives [38.758137801255714]
Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us.
There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics.
arXiv Detail & Related papers (2024-06-09T02:36:28Z) - Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - Using Left and Right Brains Together: Towards Vision and Language
Planning [95.47128850991815]
We introduce a novel vision-language planning framework to perform concurrent visual and language planning for tasks with inputs of any form.
We evaluate the effectiveness of our framework across vision-language tasks, vision-only tasks, and language-only tasks.
arXiv Detail & Related papers (2024-02-16T09:46:20Z) - Analyzing the Roles of Language and Vision in Learning from Limited Data [31.895396236504993]
We study the contributions that language and vision make to learning about the world.
We find that a language model leveraging all components recovers a majority of a Vision-Language Model's performance.
arXiv Detail & Related papers (2024-02-15T22:19:41Z) - Imagination-Augmented Natural Language Understanding [71.51687221130925]
We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks.
iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models.
Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
arXiv Detail & Related papers (2022-04-18T19:39:36Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - Can machines learn to see without visual databases? [93.73109506642112]
This paper focuses on developing machines that learn to see without needing to handle visual databases.
This might open the doors to a truly competitive track concerning deep learning technologies for vision.
arXiv Detail & Related papers (2021-10-12T13:03:54Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.