Related papers: Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

URL: http://arxiv.org/abs/2602.03595v1
Date: Tue, 03 Feb 2026 14:48:12 GMT
Title: Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
Authors: Haichao Jiang, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu,
Abstract summary: Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.<n>It significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches.
Score: 50.22481337087162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.

Related papers

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z)
Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method [96.63801368613177]
We present a new task that elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning.<n>We present a new dataset with 8,641 videos, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly understanding.<n>Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making.
arXiv Detail & Related papers (2026-01-15T08:09:04Z)
VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning [49.35834435935727]
VideoZoomer is a novel agentic framework that enables MLLMs to control their visual focus during reasoning.<n>Our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks.<n>These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks.
arXiv Detail & Related papers (2025-12-26T11:43:21Z)
Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval [12.701443847087164]
We propose an adaptive multi-agent retrieval framework that orchestrates specialized agents over multiple reasoning iterations.<n>Our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-12-02T09:52:51Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models [39.85828629779943]
Multi-objective optimization is fundamental in complex decision-making tasks.<n>Recent advances in Large Language Models (LLMs) offer enhanced explainability, adaptability, and reasoning.<n>This work proposes Reflective Evolution of Multi-objective Heuristics (REMoH), a novel framework integrating NSGA-II with LLM-based generation.
arXiv Detail & Related papers (2025-06-09T13:38:28Z)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z)
TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems [42.50432360919637]
Large language models (LLMs)-driven root cause analysis (RCA) in cloud-native systems has become a key topic of modern software operations and maintenance.<n>Existing LLM-based approaches face three key challenges: multi-modality input constraint, context window limitation, and dynamic dependence graph.<n>We propose a tool-assisted LLM agent with multi-modality observation data for fine-grained RCA, namely TAMO.
arXiv Detail & Related papers (2025-04-29T06:50:48Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without Full Large Language Model [22.870512676002463]
This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators. Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs. Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT.
arXiv Detail & Related papers (2023-10-24T03:08:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.