Related papers: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

URL: http://arxiv.org/abs/2406.05615v2
Date: Mon, 1 Jul 2024 16:05:01 GMT
Title: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Authors: Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan,
Abstract summary: Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics.
Score: 38.758137801255714
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.

Related papers

Natural Language Generation from Visual Events: Challenges and Future Directions [8.058451580903123]
We argue that any NLG task dealing with sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time.<n>We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal problem.<n>We claim that improving language-and-vision models' understanding of visual events is both timely and essential, given their growing applications.
arXiv Detail & Related papers (2025-02-18T16:48:18Z)
Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. In this paper, we examine two major approaches enabled by recent large vision language models. We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z)
Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future. We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Vision-Language Models in Remote Sensing: Current Progress and Future Trends [25.017685538386548]
Vision-language models enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image. This paper provides a comprehensive review of the research on vision-language models in remote sensing.
arXiv Detail & Related papers (2023-05-09T19:17:07Z)
EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments. We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control. EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z)
Deep Neural Networks for Visual Reasoning [12.411844611718958]
It is crucial for machines to have capacity to reason using visual perception and language understanding. Recent advances in deep learning have built separate sophisticated representations of both visual scenes and languages. This thesis advances the understanding of how to exploit and use pivotal aspects of vision-and-language tasks with neural networks to support reasoning.
arXiv Detail & Related papers (2022-09-24T12:11:00Z)
CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z)
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision. The model first learns to align visual and language representations with the MS COCO dataset. After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z)
Vision and Language: from Visual Perception to Content Creation [100.36776435627962]
"vision to language" is probably one of the most popular topics in the past five years. This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision"
arXiv Detail & Related papers (2019-12-26T14:07:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.