MTGA: Multi-View Temporal Granularity Aligned Aggregation for Event-Based Lip-Reading
- URL: http://arxiv.org/abs/2404.11979v2
- Date: Fri, 31 Jan 2025 15:51:46 GMT
- Title: MTGA: Multi-View Temporal Granularity Aligned Aggregation for Event-Based Lip-Reading
- Authors: Wenhao Zhang, Jun Wang, Yong Luo, Lei Yu, Wei Yu, Zheng He, Jialie Shen,
- Abstract summary: Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences.<n>We propose a novel framework termed Multi-view Temporality aligned Aggregation (MTGA)<n>Our method outperforms both the event-based and video-based lip-reading counterparts.
- Score: 23.071296603664656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this drawback, we propose a novel framework termed Multi-view Temporal Granularity aligned Aggregation (MTGA). Specifically, we first present a novel event representation method, namely time-segmented voxel graph list, where the most significant local voxels are temporally connected into a graph list. Then we design a spatio-temporal fusion module based on temporal granularity alignment, where the global spatial features extracted from event frames, together with the local relative spatial and temporal features contained in voxel graph list are effectively aligned and integrated. Finally, we design a temporal aggregation module that incorporates positional encoding, which enables the capture of local absolute spatial and global temporal information. Experiments demonstrate that our method outperforms both the event-based and video-based lip-reading counterparts.
Related papers
- Salient Temporal Encoding for Dynamic Scene Graph Generation [0.765514655133894]
We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs.
The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4%$ in Scene Graph Detection.
arXiv Detail & Related papers (2025-03-15T08:01:36Z) - LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs [55.81291976637705]
Large models (LMMs) excel in scene understanding but struggle with fine-temporal reasoning due to weak alignment between linguistic and visual representations.
Existing methods map textual positions and durations into the visual space from frame-based videos, but suffer from temporal sparsity that limits temporal coordination.
We introduce LFEA to leverage event cameras for temporally dense perception and frame-event fusion.
arXiv Detail & Related papers (2025-03-10T05:30:30Z) - EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching [23.54420911697424]
Event cameras have shown promise in vision applications like optical flow estimation and stereo matching.
We reformulate event-based flow estimation and stereo matching as a unified dense correspondence matching problem.
Our model can effectively handle both optical flow and stereo estimation, achieving state-of-the-art performance on both tasks.
arXiv Detail & Related papers (2024-07-31T16:43:20Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization [62.751303924391564]
How to effectively explore spatial-temporal features is important for video colorization.
We develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames.
We develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood.
arXiv Detail & Related papers (2024-04-09T12:23:30Z) - Local Compressed Video Stream Learning for Generic Event Boundary
Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
Existing methods typically require video frames to be decoded before feeding into the network.
We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z) - Local-Global Temporal Difference Learning for Satellite Video
Super-Resolution [55.69322525367221]
We propose to exploit the well-defined temporal difference for efficient and effective temporal compensation.
To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies.
Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-10T07:04:40Z) - Follow the Timeline! Generating Abstractive and Extractive Timeline
Summary in Chronological Order [78.46986998674181]
We propose a Unified Timeline Summarizer (UTS) that can generate abstractive and extractive timeline summaries in time order.
We augment the previous Chinese large-scale timeline summarization dataset and collect a new English timeline dataset.
UTS achieves state-of-the-art performance in terms of both automatic and human evaluations.
arXiv Detail & Related papers (2023-01-02T20:29:40Z) - Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames.
In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information.
The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z) - Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign
Language Recognition [4.059599144668737]
Continuous sign language recognition is a public significant task that transcribes a sign language video into an ordered gloss sequence.
One promising way is to adopt a one-dimensional convolutional network (1D-CNN) to temporally fuse the sequential frames.
We propose to adaptively fuse local features via temporal similarity for this task.
arXiv Detail & Related papers (2021-07-27T12:06:56Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.