Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual
Query Localization
- URL: http://arxiv.org/abs/2211.10528v2
- Date: Thu, 6 Apr 2023 09:21:18 GMT
- Title: Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual
Query Localization
- Authors: Mengmeng Xu, Yanghao Li, Cheng-Yang Fu, Bernard Ghanem, Tao Xiang,
Juan-Manuel Perez-Rua
- Abstract summary: This paper deals with the problem of localizing objects in image and video datasets from visual exemplars.
We first identify grave implicit biases in current query-conditioned model design and visual query datasets.
We propose a novel transformer-based module that allows for object-proposal set context to be considered.
- Score: 119.23191388798921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper deals with the problem of localizing objects in image and video
datasets from visual exemplars. In particular, we focus on the challenging
problem of egocentric visual query localization. We first identify grave
implicit biases in current query-conditioned model design and visual query
datasets. Then, we directly tackle such biases at both frame and object set
levels. Concretely, our method solves these issues by expanding limited
annotations and dynamically dropping object proposals during training.
Additionally, we propose a novel transformer-based module that allows for
object-proposal set context to be considered while incorporating query
information. We name our module Conditioned Contextual Transformer or
CocoFormer. Our experiments show the proposed adaptations improve egocentric
query detection, leading to a better visual query localization system in both
2D and 3D configurations. Thus, we are able to improve frame-level detection
performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D
and VQ3D localization scores by significant margins. Our improved context-aware
query object detector ranked first and second in the VQ2D and VQ3D tasks in the
2nd Ego4D challenge. In addition to this, we showcase the relevance of our
proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA
results. Our code is available at
https://github.com/facebookresearch/vq2d_cvpr.
Related papers
- PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion
Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts.
We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues.
We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z) - Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - V-DETR: DETR with Vertex Relative Position Encoding for 3D Object
Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method.
We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z) - EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z) - Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline [35.717047755880536]
3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity.
We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations.
We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
arXiv Detail & Related papers (2022-09-24T15:09:02Z) - Negative Frames Matter in Egocentric Visual Query 2D Localization [119.23191388798921]
Recently released Ego4D dataset and benchmark significantly scales and diversifies first-person visual perception data.
Visual Queries 2D localization task aims to retrieve objects appeared in the past from the recording in the first-person view.
Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark.
arXiv Detail & Related papers (2022-08-03T09:54:51Z) - Deformable PV-RCNN: Improving 3D Object Detection with Learned
Deformations [11.462554246732683]
We present Deformable PV-RCNN, a high-performing point-cloud based 3D object detector.
We present a proposal refinement module inspired by 2D deformable convolution networks.
We show state-of-the-art results on the KITTI dataset.
arXiv Detail & Related papers (2020-08-20T04:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.