Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
- URL: http://arxiv.org/abs/2404.03645v1
- Date: Thu, 4 Apr 2024 17:58:21 GMT
- Title: Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
- Authors: Shuting He, Henghui Ding,
- Abstract summary: We propose to decouple video-level referring expression understanding into static and motion perception.
We employ contrastive learning to distinguish the motions of visually similar objects.
These contributions yield state-of-the-art performance across five datasets.
- Score: 32.11635464720755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.
Related papers
- LocoMotion: Learning Motion-Focused Video-Language Representations [45.33444862034461]
We propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions.
We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions.
arXiv Detail & Related papers (2024-10-15T19:33:57Z) - Motion meets Attention: Video Motion Prompts [34.429192862783054]
We propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame differencing maps.
This approach generates a sequence of attention maps that enhance the processing of motion-related video content.
We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowGym, X3D, and Timeformer.
arXiv Detail & Related papers (2024-07-03T14:59:46Z) - Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography"
It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.
Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot
Action Recognition [50.345327516891615]
We develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching.
arXiv Detail & Related papers (2023-04-03T13:09:39Z) - InstMove: Instance Motion for Object-centric Video Segmentation [70.16915119724757]
In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video.
In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings.
With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks.
arXiv Detail & Related papers (2023-03-14T17:58:44Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [51.17064599766138]
We have developed a method to identify independently objects acquired with an event-based camera.
The method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
arXiv Detail & Related papers (2020-12-16T04:06:02Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.