CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
- URL: http://arxiv.org/abs/2510.06243v1
- Date: Fri, 03 Oct 2025 08:50:21 GMT
- Title: CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
- Authors: Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, Yun Fu,
- Abstract summary: CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
- Score: 67.18702329644526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.
Related papers
- REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation [10.151027538362259]
REVEALER is a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning.<n>Our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments.
arXiv Detail & Related papers (2025-12-29T03:24:09Z) - STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment [24.80531387685099]
We propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance.<n>First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned.<n>Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations.
arXiv Detail & Related papers (2025-08-28T16:04:39Z) - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning [8.095763327154335]
We present our approach to the PerAnsSumm Shared Task, which involves perspective span identification and perspective-aware summarization.<n>For span identification, we adopt ensemble learning that integrates three transformer models through averaging to exploit individual model strengths.<n>For summarization, we design a suite of Chain-of-Thought (CoT) prompting strategies that incorporate keyphrases and guide information to structure summary generation into manageable steps.
arXiv Detail & Related papers (2025-03-14T06:29:51Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - Contextualizing Search Queries In-Context Learning for Conversational Rewriting with LLMs [0.0]
This paper introduces Prompt-Guided In-Context Learning, a novel approach for few-shot conversational query rewriting.<n>Our method employs carefully designed prompts, incorporating task descriptions, input/output format specifications, and a small set of illustrative examples.<n>Experiments on benchmark datasets, TREC and Taskmaster-1, demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-02-20T20:02:42Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness [3.2925222641796554]
"pointer-guided segment ordering" (SO) is a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations.
Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures.
arXiv Detail & Related papers (2024-06-06T15:17:51Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Improving Meta-learning for Low-resource Text Classification and
Generation via Memory Imitation [87.98063273826702]
We propose a memory imitation meta-learning (MemIML) method that enhances the model's reliance on support sets for task adaptation.
A theoretical analysis is provided to prove the effectiveness of our method.
arXiv Detail & Related papers (2022-03-22T12:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.