Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2506.04034v1
- Date: Wed, 04 Jun 2025 14:56:57 GMT
- Title: Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning
- Authors: Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang,
- Abstract summary: Object referring aims to detect all objects in an image that match a given natural language description.<n>We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content.<n>We propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task.
- Score: 30.76576641002295
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
Related papers
- Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation [61.37076111486196]
Ref-AVS aims to segment target objects in audible videos based on given reference expressions.<n>We propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process.<n>Ref-Thinker is a multimodal language model capable of reasoning over textual, visual, and auditory cues.
arXiv Detail & Related papers (2025-08-06T13:05:09Z) - Fixing confirmation bias in feature attribution methods via semantic
match [4.733072355085082]
We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions.
This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations.
arXiv Detail & Related papers (2023-07-03T09:50:08Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones.
SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders.
Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z) - Cops-Ref: A new Dataset and Task on Compositional Referring Expression
Comprehension [39.40351938417889]
Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression.
Some popular referring expression datasets fail to provide an ideal test bed for evaluating the reasoning ability of the models.
We propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
arXiv Detail & Related papers (2020-03-01T04:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.