Dilated Context Integrated Network with Cross-Modal Consensus for
Temporal Emotion Localization in Videos
- URL: http://arxiv.org/abs/2208.01954v1
- Date: Wed, 3 Aug 2022 10:00:49 GMT
- Title: Dilated Context Integrated Network with Cross-Modal Consensus for
Temporal Emotion Localization in Videos
- Authors: Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao
Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian, Yueting Zhuang
- Abstract summary: TEL presents three unique challenges compared to temporal action localization.
The emotions have extremely varied temporal dynamics.
The fine-grained temporal annotations are complicated and labor-intensive.
- Score: 128.70585652795637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding human emotions is a crucial ability for intelligent robots to
provide better human-robot interactions. The existing works are limited to
trimmed video-level emotion classification, failing to locate the temporal
window corresponding to the emotion. In this paper, we introduce a new task,
named Temporal Emotion Localization in videos~(TEL), which aims to detect human
emotions and localize their corresponding temporal boundaries in untrimmed
videos with aligned subtitles. TEL presents three unique challenges compared to
temporal action localization: 1) The emotions have extremely varied temporal
dynamics; 2) The emotion cues are embedded in both appearances and complex
plots; 3) The fine-grained temporal annotations are complicated and
labor-intensive. To address the first two challenges, we propose a novel
dilated context integrated network with a coarse-fine two-stream architecture.
The coarse stream captures varied temporal dynamics by modeling
multi-granularity temporal contexts. The fine stream achieves complex plots
understanding by reasoning the dependency between the multi-granularity
temporal contexts from the coarse stream and adaptively integrates them into
fine-grained video segment features. To address the third challenge, we
introduce a cross-modal consensus learning paradigm, which leverages the
inherent semantic consensus between the aligned video and subtitle to achieve
weakly-supervised learning. We contribute a new testing set with 3,000
manually-annotated temporal boundaries so that future research on the TEL
problem can be quantitatively evaluated. Extensive experiments show the
effectiveness of our approach on temporal emotion localization. The repository
of this work is at
https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.
Related papers
- Dual-path Collaborative Generation Network for Emotional Video Captioning [33.230028098522254]
Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos.
Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation.
We propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions.
arXiv Detail & Related papers (2024-08-06T07:30:53Z) - MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues [0.0]
We propose a time-sensitive Multimodal Large Language Model (MLLM) aimed at directing attention to the local facial micro-expression dynamics.
Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; and (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them.
arXiv Detail & Related papers (2024-07-23T15:05:55Z) - HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities.
We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.