Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding
- URL: http://arxiv.org/abs/2203.02966v1
- Date: Sun, 6 Mar 2022 13:57:09 GMT
- Title: Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding
- Authors: Daizong Liu, Xiang Fang, Wei Hu, Pan Zhou
- Abstract summary: Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
- Score: 61.57847727651068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding aims to localize a target segment in an untrimmed
video semantically according to a given sentence query. Most previous works
focus on learning frame-level features of each whole frame in the entire video,
and directly match them with the textual information. Such frame-level feature
extraction leads to the obstacles of these methods in distinguishing ambiguous
video frames with complicated contents and subtle appearance differences, thus
limiting their performance. In order to differentiate fine-grained appearance
similarities among consecutive frames, some state-of-the-art methods
additionally employ a detection model like Faster R-CNN to obtain detailed
object-level features in each frame for filtering out the redundant background
contents. However, these methods suffer from missing motion analysis since the
object detection module in Faster R-CNN lacks temporal modeling. To alleviate
the above limitations, in this paper, we propose a novel Motion- and
Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates
optical-flow-guided motion-aware, detection-based appearance-aware, and
3D-aware object-level features to better reason the spatial-temporal object
relations for accurately modelling the activity among consecutive frames.
Specifically, we first develop three individual branches for motion,
appearance, and 3D encoding separately to learn fine-grained motion-guided,
appearance-guided, and 3D-aware object features, respectively. Then, both
motion and appearance information from corresponding branches are associated to
enhance the 3D-aware features for the final precise grounding. Extensive
experiments on three challenging datasets (ActivityNet Caption, Charades-STA
and TACoS) demonstrate that the proposed MA3SRN model achieves a new
state-of-the-art.
Related papers
- Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences [25.74000325019015]
We introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information.
We have conducted experiments on the aggregation and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance.
arXiv Detail & Related papers (2024-09-06T16:29:04Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Delving into Motion-Aware Matching for Monocular 3D Object Tracking [81.68608983602581]
We find that the motion cue of objects along different time frames is critical in 3D multi-object tracking.
We propose MoMA-M3T, a framework that mainly consists of three motion-aware components.
We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate our MoMA-M3T achieves competitive performance against state-of-the-art methods.
arXiv Detail & Related papers (2023-08-22T17:53:58Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking
from View Aggregation [8.854112907350624]
3D multi-object tracking plays a vital role in autonomous navigation.
Many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space.
We propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames.
arXiv Detail & Related papers (2020-11-25T16:14:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.