Language-Guided Diffusion Model for Visual Grounding
- URL: http://arxiv.org/abs/2308.09599v3
- Date: Tue, 25 Feb 2025 14:41:29 GMT
- Title: Language-Guided Diffusion Model for Visual Grounding
- Authors: Sijia Chen, Baochun Li,
- Abstract summary: Existing approaches complete such visual-text reasoning in a single-step manner.<n>We propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes.<n>Experiments on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way.
- Score: 33.714789952452094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG.
Related papers
- RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation [71.2136732268131]
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions.<n>Existing RGBT trackers rely solely on initial-frame visual information for target modeling.<n>We propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking.
arXiv Detail & Related papers (2026-03-04T01:02:04Z) - Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z) - Multimodal Latent Reasoning via Hierarchical Visual Cues Injection [16.779425236020433]
This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly.<n>We propose a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales.<n>We show that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
arXiv Detail & Related papers (2026-02-05T06:31:12Z) - Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge [16.958159611661813]
Latent Denoising Diffusion Bridge Model (LDDBM) is a general-purpose framework for modality translation.<n>By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions.<n>Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis.
arXiv Detail & Related papers (2025-10-23T17:59:54Z) - VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [22.907814548315468]
We propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception.
By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations.
Our method consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs.
arXiv Detail & Related papers (2025-03-10T16:49:35Z) - NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning [22.60247555240363]
This paper explores challenges for methods that require reasoning like human cognition.
We propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning.
Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines.
arXiv Detail & Related papers (2025-02-01T09:19:08Z) - Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces.
We directly leverage natural language prompts and image captions to map latent directions.
Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z) - DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution [54.05367433562495]
DynRefer aims to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition.
During training, DynRefer aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region.
Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, openvocabulary region recognition and detection.
arXiv Detail & Related papers (2024-05-25T05:44:55Z) - HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding [80.85164509232261]
HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm.
HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner.
arXiv Detail & Related papers (2024-04-20T14:57:31Z) - Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation [2.104191333263349]
Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain visual features and linguistic features.
This problem arises from the domain gap between the single-modal pre-training backbones used in current visual grounding methods.
We propose an Empowering Pre-trained Model for Visual Grounding framework, which distills a multimodal pre-trained model to guide the visual grounding task.
arXiv Detail & Related papers (2023-12-29T15:32:11Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - Linguistic Structure Guided Context Modeling for Referring Image
Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction.
Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.