RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
- URL: http://arxiv.org/abs/2412.01826v1
- Date: Mon, 02 Dec 2024 18:59:53 GMT
- Title: RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
- Authors: Savya Khosla, Sethuraman T V, Alexander Schwing, Derek Hoiem,
- Abstract summary: RELOCATE is a training-free baseline designed to perform the challenging task of visual query localization in long videos.
To eliminate the need for task-specific training, RELOCATE leverages a region-based representation derived from pretrained vision models.
- Score: 55.74675012171316
- License:
- Abstract: We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Semantic Object-level Modeling for Robust Visual Camera Relocalization [14.998133272060695]
We propose a novel method of automatic object-level voxel modeling for accurate ellipsoidal representations of objects.
All of these modules are entirely intergrated into visual SLAM system.
arXiv Detail & Related papers (2024-02-10T13:39:44Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual
Query Localization [119.23191388798921]
This paper deals with the problem of localizing objects in image and video datasets from visual exemplars.
We first identify grave implicit biases in current query-conditioned model design and visual query datasets.
We propose a novel transformer-based module that allows for object-proposal set context to be considered.
arXiv Detail & Related papers (2022-11-18T22:50:50Z) - A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic
Search [71.14527779661181]
Visual room rearrangement evaluates an agent's ability to rearrange objects based solely on visual input.
We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete.
arXiv Detail & Related papers (2022-06-21T02:33:57Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.