Negative Frames Matter in Egocentric Visual Query 2D Localization
- URL: http://arxiv.org/abs/2208.01949v1
- Date: Wed, 3 Aug 2022 09:54:51 GMT
- Title: Negative Frames Matter in Egocentric Visual Query 2D Localization
- Authors: Mengmeng Xu, Cheng-Yang Fu, Yanghao Li, Bernard Ghanem, Juan-Manuel
Perez-Rua, Tao Xiang
- Abstract summary: Recently released Ego4D dataset and benchmark significantly scales and diversifies first-person visual perception data.
Visual Queries 2D localization task aims to retrieve objects appeared in the past from the recording in the first-person view.
Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark.
- Score: 119.23191388798921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently released Ego4D dataset and benchmark significantly scales and
diversifies the first-person visual perception data. In Ego4D, the Visual
Queries 2D Localization task aims to retrieve objects appeared in the past from
the recording in the first-person view. This task requires a system to
spatially and temporally localize the most recent appearance of a given object
query, where query is registered by a single tight visual crop of the object in
a different scene.
Our study is based on the three-stage baseline introduced in the Episodic
Memory benchmark. The baseline solves the problem by detection and tracking:
detect the similar objects in all the frames, then run a tracker from the most
confident detection result. In the VQ2D challenge, we identified two
limitations of the current baseline. (1) The training configuration has
redundant computation. Although the training set has millions of instances,
most of them are repetitive and the number of unique object is only around
14.6k. The repeated gradient computation of the same object lead to an
inefficient training; (2) The false positive rate is high on background frames.
This is due to the distribution gap between training and evaluation. During
training, the model is only able to see the clean, stable, and labeled frames,
but the egocentric videos also have noisy, blurry, or unlabeled background
frames. To this end, we developed a more efficient and effective solution.
Concretely, we bring the training loop from ~15 days to less than 24 hours, and
we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline.
Our solution got the first ranking on the public leaderboard. Our code is
publicly available at https://github.com/facebookresearch/vq2d_cvpr.
Related papers
- EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z) - Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual
Query Localization [119.23191388798921]
This paper deals with the problem of localizing objects in image and video datasets from visual exemplars.
We first identify grave implicit biases in current query-conditioned model design and visual query datasets.
We propose a novel transformer-based module that allows for object-proposal set context to be considered.
arXiv Detail & Related papers (2022-11-18T22:50:50Z) - Single Object Tracking through a Fast and Effective Single-Multiple
Model Convolutional Neural Network [0.0]
Recent state-of-the-art (SOTA) approaches are proposed based on taking a matching network with a heavy structure to distinguish the target from other objects in the area.
In this article, a special architecture is proposed based on which in contrast to the previous approaches, it is possible to identify the object location in a single shot.
The presented tracker performs comparatively with the SOTA in challenging situations while having a super speed compared to them (up to $120 FPS$ on 1080ti)
arXiv Detail & Related papers (2021-03-28T11:02:14Z) - Detecting Invisible People [58.49425715635312]
We re-purpose tracking benchmarks and propose new metrics for the task of detecting invisible objects.
We demonstrate that current detection and tracking systems perform dramatically worse on this task.
Second, we build dynamic models that explicitly reason in 3D, making use of observations produced by state-of-the-art monocular depth estimation networks.
arXiv Detail & Related papers (2020-12-15T16:54:45Z) - Factor Graph based 3D Multi-Object Tracking in Point Clouds [8.411514688735183]
We propose a novel optimization-based approach that does not rely on explicit and fixed assignments.
We demonstrate its performance on the real world KITTI tracking dataset and achieve better results than many state-of-the-art algorithms.
arXiv Detail & Related papers (2020-08-12T13:34:46Z) - Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance
Disparity Estimation [51.17232267143098]
We propose a novel system named Disp R-CNN for 3D object detection from stereo images.
We use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds.
Experiments on the KITTI dataset show that, even when LiDAR ground-truth is not available at training time, Disp R-CNN achieves competitive performance and outperforms previous state-of-the-art methods by 20% in terms of average precision.
arXiv Detail & Related papers (2020-04-07T17:48:45Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.