Towards Agentic AI for Multimodal-Guided Video Object Segmentation
- URL: http://arxiv.org/abs/2508.10572v1
- Date: Thu, 14 Aug 2025 12:11:15 GMT
- Title: Towards Agentic AI for Multimodal-Guided Video Object Segmentation
- Authors: Tuyen Tran, Thao Minh Le, Truyen Tran,
- Abstract summary: Referring-based Video Object is a multimodal problem that requires producing fine-grained segmentation results guided by external cues.<n>Recent advances in vision-language foundation models open a promising direction toward training-free approaches.<n>We propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner.
- Score: 14.877182670778284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.
Related papers
- AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z) - Large Language Model as Meta-Surrogate for Data-Driven Many-Task Optimization: A Proof-of-Principle Study [11.452011929848844]
This study proposes a novel meta-surrogate framework to assist many-task optimization.<n>We formulate a unified framework for many-task fitness prediction, by defining a universal model with metadata to fit a group of problems.<n>Our framework supports dual-level knowledge transfer -- at both the surrogate and individual levels -- enhancing optimization efficiency and robustness.
arXiv Detail & Related papers (2025-03-11T11:13:11Z) - Model Evolution Framework with Genetic Algorithm for Multi-Task Reinforcement Learning [85.91908329457081]
Multi-task reinforcement learning employs a single policy to complete various tasks, aiming to develop an agent with generalizability across different scenarios.<n>Existing approaches typically use a routing network to generate specific routes for each task and reconstruct a set of modules into diverse models to complete multiple tasks simultaneously.<n>We propose a Model Evolution framework with Genetic Algorithm (MEGA), which enables the model to evolve during training according to the difficulty of the tasks.
arXiv Detail & Related papers (2025-02-19T09:22:34Z) - Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning [23.241807997843427]
We propose a unified framework that concurrently handles multiple tasks and modalities.<n>In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach.<n>We present a new benchmark, MMUD, which includes samples annotated with multiple task labels.<n>We demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner.
arXiv Detail & Related papers (2024-08-06T07:19:51Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Meta-Task Prompting Elicits Embeddings from Large Language Models [54.757445048329735]
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation.
We generate high-quality sentence embeddings from Large Language Models without the need for model fine-tuning.
Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
arXiv Detail & Related papers (2024-02-28T16:35:52Z) - Towards Unified Task Embeddings Across Multiple Models: Bridging the Gap for Prompt-Based Large Language Models and Beyond [16.913115978881866]
We propose a framework for unified task embeddings (FUTE), task embeddings from various models, including smaller language models and Large Language Models with varied prompts, within a single vector space.
Such uniformity enables comparison and analysis of similarities amongst different models, broadening the scope and utility of existing task embedding methods in multi-model scenarios.
arXiv Detail & Related papers (2024-02-22T13:13:31Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Towards Robust Multi-Modal Reasoning via Model Selection [7.6621866737827045]
LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving.
We propose the $textitM3$ framework as a plug-in with negligible runtime overhead at test-time.
Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies.
arXiv Detail & Related papers (2023-10-12T16:06:18Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.