Related papers: TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

URL: http://arxiv.org/abs/2603.02929v2
Date: Wed, 04 Mar 2026 02:21:55 GMT
Title: TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval
Authors: Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, Jinqiao Wang,
Abstract summary: Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents.<n>We introduce TRACE (Task-adaptive Reasoning And Embeddings)<n>TRACE unifies generative reasoning with discriminative representation learning.
Score: 35.86480813138274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

Related papers

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z)
Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision [11.159231524113764]
Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs)<n>In this paper, we propose the textbfGuided Verifier framework to address these structural limitations.<n>We develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing textbfCoRe dataset of process-level negatives and textbfCorrect-guide textbfReasoning trajectories to train the guided verifier.
arXiv Detail & Related papers (2026-02-04T07:38:42Z)
CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction [50.67483317563736]
This paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results.<n>We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction.
arXiv Detail & Related papers (2026-01-24T11:41:54Z)
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z)
Fast Thinking for Large Language Models [67.7238685892317]
We introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors.<n>At inference, the model conditions on a handful of continuous thinking switches distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens.
arXiv Detail & Related papers (2025-09-28T04:19:48Z)
RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow [19.502882116487005]
Remote sensing imagery presents vast, inherently unstructured spatial data.<n>We propose RemoteReasoner, a unified workflow for geospatial reasoning.<n>RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks.
arXiv Detail & Related papers (2025-07-25T13:58:11Z)
In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly [25.47694115798524]
In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates.<n>This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones.
arXiv Detail & Related papers (2025-06-24T06:33:00Z)
Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router [9.580226379350737]
Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models.<n>Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models.<n>We propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs.
arXiv Detail & Related papers (2025-06-06T09:18:56Z)
A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts [33.284445296875916]
We introduce a formal framework demonstrating that transformer models, when provided with carefully designed prompts, can act as a computational system.<n>We establish an approximation theory for $beta$-times differentiable functions, proving that transformers can approximate such functions with arbitrary precision when guided by appropriately structured prompts.<n>Our findings underscore their potential for autonomous reasoning and problem-solving, paving the way for more robust and theoretically grounded advancements in prompt engineering and AI agent design.
arXiv Detail & Related papers (2025-03-26T13:58:02Z)
On the Diagram of Thought [20.805936414171892]
Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning.<n>We introduce the Diagram of Thought (DoT), a new framework that enables a single LLM to build and navigate a mental map of its reasoning.
arXiv Detail & Related papers (2024-09-16T07:01:41Z)
Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning [89.89857766491475]
We propose a curriculum-based logical-aware instruction tuning framework, named LACT.<n>Specifically, we augment the arbitrary first-order logical queries via binary tree decomposition.<n> Experiments across widely used datasets demonstrate that LACT has substantial improvements(brings an average +5.5% MRR score) over advanced methods, achieving the new state-of-the-art.
arXiv Detail & Related papers (2024-05-02T18:12:08Z)
Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning [74.90592233107712]
We propose a Direct-Indirect Reasoning (DIR) method, which considers Direct Reasoning (DR) and Indirect Reasoning (IR) as multiple parallel reasoning paths that are merged to derive the final answer.<n>Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods.
arXiv Detail & Related papers (2024-02-06T03:41:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.