Video Relation Detection with Trajectory-aware Multi-modal Features
- URL: http://arxiv.org/abs/2101.08165v1
- Date: Wed, 20 Jan 2021 14:49:02 GMT
- Title: Video Relation Detection with Trajectory-aware Multi-modal Features
- Authors: Wentao Xie, Guanghui Ren, Si Liu
- Abstract summary: We present video relation detection with trajectory-aware multi-modal features to solve this task.
Our method won the first place on the video relation detection task of Video Relation Understanding Grand Challenge in ACM Multimedia 2020 with 11.74% mAP.
- Score: 13.358584829993193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video relation detection problem refers to the detection of the relationship
between different objects in videos, such as spatial relationship and action
relationship. In this paper, we present video relation detection with
trajectory-aware multi-modal features to solve this task.
Considering the complexity of doing visual relation detection in videos, we
decompose this task into three sub-tasks: object detection, trajectory proposal
and relation prediction. We use the state-of-the-art object detection method to
ensure the accuracy of object trajectory detection and multi-modal feature
representation to help the prediction of relation between objects. Our method
won the first place on the video relation detection task of Video Relation
Understanding Grand Challenge in ACM Multimedia 2020 with 11.74\% mAP, which
surpasses other methods by a large margin.
Related papers
- End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories.
Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories.
We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z) - EGTR: Extracting Graph from Transformer for Scene Graph Generation [5.935927309154952]
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects.
We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder.
We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets.
arXiv Detail & Related papers (2024-04-02T16:20:02Z) - Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Multi-Task Learning based Video Anomaly Detection with Attention [1.2944868613449219]
We propose a novel multi-task learning based method that combines complementary proxy tasks to better consider the motion and appearance features.
We combine the semantic segmentation and future frame prediction tasks in a single branch to learn the object class and consistent motion patterns.
In the second branch, we added several attention mechanisms to detect motion anomalies with attention to object parts, the direction of motion, and the distance of the objects from the camera.
arXiv Detail & Related papers (2022-10-14T10:40:20Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z) - Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey [71.10448142010422]
Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories.
Embedding methods play an essential role in object location estimation and temporal identity association in MOT.
We first conduct a comprehensive overview with in-depth analysis for embedding methods in MOT from seven different perspectives.
arXiv Detail & Related papers (2022-05-22T06:54:33Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z) - Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video
Grounding [90.12181414070496]
We propose a novel object-aware multi-branch relation network for object-aware relation discovery.
We then propose multi-branch reasoning to capture critical object relationships between the main branch and auxiliary branches.
arXiv Detail & Related papers (2020-08-16T15:39:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.