A Simple and Better Baseline for Visual Grounding
- URL: http://arxiv.org/abs/2510.10587v1
- Date: Sun, 12 Oct 2025 13:06:59 GMT
- Title: A Simple and Better Baseline for Visual Grounding
- Authors: Jingchao Wang, Wenlong Zhang, Dingjiang Huang, Hong Wang, Yefeng Zheng,
- Abstract summary: We propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG.<n>Specifically, we encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures.<n>We introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction.
- Score: 41.76403278559263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.
Related papers
- Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z) - More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - Multi-Granularity Language-Guided Training for Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.<n>At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.<n>Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks.<n>$pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency.<n>As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.