EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset
- URL: http://arxiv.org/abs/2301.03213v5
- Date: Sun, 1 Oct 2023 22:54:53 GMT
- Title: EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset
- Authors: Hao Tang, Kevin Liang, Matt Feiszli, Weiyao Wang
- Abstract summary: Embodied tracking is a key component to many egocentric vision problems.
EgoTracks is a new dataset for long-term egocentric visual object tracking.
We show improvements that can be made to a STARK tracker to significantly increase its performance on egocentric data.
- Score: 19.496721051685135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual object tracking is a key component to many egocentric vision problems.
However, the full spectrum of challenges of egocentric tracking faced by an
embodied AI is underrepresented in many existing datasets; these tend to focus
on relatively short, third-person videos. Egocentric video has several
distinguishing characteristics from those commonly found in past datasets:
frequent large camera motions and hand interactions with objects commonly lead
to occlusions or objects exiting the frame, and object appearance can change
rapidly due to widely different points of view, scale, or object states.
Embodied tracking is also naturally long-term, and being able to consistently
(re-)associate objects to their appearances and disappearances over as long as
a lifetime is critical. Previous datasets under-emphasize this re-detection
problem, and their "framed" nature has led to adoption of various
spatiotemporal priors that we find do not necessarily generalize to egocentric
video. We thus introduce EgoTracks, a new dataset for long-term egocentric
visual object tracking. Sourced from the Ego4D dataset, this new dataset
presents a significant challenge to recent state-of-the-art single-object
tracking models, which we find score poorly on traditional tracking metrics for
our new dataset, compared to popular benchmarks. We further show improvements
that can be made to a STARK tracker to significantly increase its performance
on egocentric data, resulting in a baseline model we call EgoSTARK. We publicly
release our annotations and benchmark, hoping our dataset leads to further
advancements in tracking.
Related papers
- Tracking Reflected Objects: A Benchmark [12.770787846444406]
We introduce TRO, a benchmark specifically for Tracking Reflected Objects.
TRO includes 200 sequences with around 70,000 frames, each carefully annotated with bounding boxes.
To provide a stronger baseline, we propose a new tracker, HiP-HaTrack, which uses hierarchical features to improve performance.
arXiv Detail & Related papers (2024-07-07T02:22:45Z) - EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object
Understanding [11.9023437362986]
EgoObjects is a large-scale egocentric dataset for fine-grained object understanding.
Pilot version contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices.
EgoObjects also annotates each object with an instance-level identifier.
arXiv Detail & Related papers (2023-09-15T23:55:43Z) - DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse
Motion [56.1428110894411]
We propose a large-scale dataset for multi-human tracking, where humans have similar appearance, diverse motion and extreme articulation.
As the dataset contains mostly group dancing videos, we name it "DanceTrack"
We benchmark several state-of-the-art trackers on our dataset and observe a significant performance drop on DanceTrack when compared against existing benchmarks.
arXiv Detail & Related papers (2021-11-29T16:49:06Z) - Real Time Egocentric Object Segmentation: THU-READ Labeling and
Benchmarking Results [0.0]
Egocentric segmentation has attracted recent interest in the computer vision community due to their potential in Mixed Reality (MR) applications.
We contribute with a semantic-wise labeling of a subset of 2124 images from the RGB-D THU-READ dataset.
We also report benchmarking results using Thundernet, a real-time semantic segmentation network, that could allow future integration with end-to-end MR applications.
arXiv Detail & Related papers (2021-06-09T10:10:02Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z) - Learning Target Candidate Association to Keep Track of What Not to Track [100.80610986625693]
We propose to keep track of distractor objects in order to continue tracking the target.
To tackle the problem of lacking ground-truth correspondences between distractor objects in visual tracking, we propose a training strategy that combines partial annotations with self-supervision.
Our tracker sets a new state-of-the-art on six benchmarks, achieving an AUC score of 67.2% on LaSOT and a +6.1% absolute gain on the OxUvA long-term dataset.
arXiv Detail & Related papers (2021-03-30T17:58:02Z) - Learning to Track with Object Permanence [61.36492084090744]
We introduce an end-to-end trainable approach for joint object detection and tracking.
Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI, and MOT17 datasets.
arXiv Detail & Related papers (2021-03-26T04:43:04Z) - SoDA: Multi-Object Tracking with Soft Data Association [75.39833486073597]
Multi-object tracking (MOT) is a prerequisite for a safe deployment of self-driving cars.
We propose a novel approach to MOT that uses attention to compute track embeddings that encode dependencies between observed objects.
arXiv Detail & Related papers (2020-08-18T03:40:25Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.