Siamese Tracking with Lingual Object Constraints
- URL: http://arxiv.org/abs/2011.11721v1
- Date: Mon, 23 Nov 2020 20:55:08 GMT
- Title: Siamese Tracking with Lingual Object Constraints
- Authors: Maximilian Filtenborg, Efstratios Gavves, Deepak Gupta
- Abstract summary: This paper explores, tracking visual objects subjected to additional lingual constraints.
Differently from Li et al., we impose additional lingual constraints upon tracking, which enables new applications of tracking.
Our method enables the selective compression of videos, based on the validity of the constraint.
- Score: 28.04334832366449
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classically, visual object tracking involves following a target object
throughout a given video, and it provides us the motion trajectory of the
object. However, for many practical applications, this output is often
insufficient since additional semantic information is required to act on the
video material. Example applications of this are surveillance and
target-specific video summarization, where the target needs to be monitored
with respect to certain predefined constraints, e.g., 'when standing near a
yellow car'. This paper explores, tracking visual objects subjected to
additional lingual constraints. Differently from Li et al., we impose
additional lingual constraints upon tracking, which enables new applications of
tracking. Whereas in their work the goal is to improve and extend upon tracking
itself. To perform benchmarks and experiments, we contribute two datasets:
c-MOT16 and c-LaSOT, curated through appending additional constraints to the
frames of the original LaSOT and MOT16 datasets. We also experiment with two
deep models SiamCT-DFG and SiamCT-CA, obtained through extending a recent
state-of-the-art Siamese tracking method and adding modules inspired from the
fields of natural language processing and visual question answering. Through
experimental results, we show that the proposed model SiamCT-CA can
significantly outperform its counterparts. Furthermore, our method enables the
selective compression of videos, based on the validity of the constraint.
Related papers
- Appearance-based Refinement for Object-Centric Motion Segmentation [95.80420062679104]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a simple selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models [28.304047711166056]
Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
arXiv Detail & Related papers (2023-10-10T20:25:30Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - OmniTracker: Unifying Object Tracking by Tracking-with-Detection [119.51012668709502]
OmniTracker is presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline.
Experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
arXiv Detail & Related papers (2023-03-21T17:59:57Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.