Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
- URL: http://arxiv.org/abs/2602.03892v1
- Date: Tue, 03 Feb 2026 07:47:59 GMT
- Title: Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
- Authors: Jinxing Zhou, Yanghao Zhou, Yaoting Wang, Zongyan Han, Jiaqi Ma, Henghui Ding, Rao Muhammad Anwer, Hisham Cholakkal,
- Abstract summary: Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS)<n>MQA-RefAVS is a task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations.<n>We propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information.
- Score: 79.13636675697096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA-RefAVS.
Related papers
- Segment and Matte Anything in a Unified Model [5.8874968768571625]
Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting.<n>We introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting.
arXiv Detail & Related papers (2026-01-17T19:43:10Z) - Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z) - SimToken: A Simple Baseline for Referring Audio-Visual Segmentation [29.88252418748085]
Referring Audio-Visual (Ref-AVS) aims to segment specific objects in videos based on natural language expressions.<n>This task poses significant challenges in cross-modal reasoning and fine-grained object localization.<n>We propose a framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM)
arXiv Detail & Related papers (2025-09-22T08:55:04Z) - Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence [22.45673628231233]
Action-based video object segmentation addresses this by linking segmentation with action semantics.<n>We take the first step by studying action-based video object segmentation under label noise.<n>We adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them.
arXiv Detail & Related papers (2025-09-20T13:03:43Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - Refer to Any Segmentation Mask Group With Vision-Language Prompts [79.43440775648824]
"Refer to Any Mask Group" (RAS) augments segmentation models with complex multimodal interactions and comprehension.<n>We demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks.
arXiv Detail & Related papers (2025-06-05T17:59:51Z) - Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image.
We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z) - Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes [11.575313825919205]
We introduce a novel task called Reference Audio-Visual Traditional (Ref-AVS)
Ref-AVS seeks to segment objects based on expressions containing multimodal cues.
We propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance.
arXiv Detail & Related papers (2024-07-15T17:54:45Z) - GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - MASR: Multi-label Aware Speech Representation [36.2978180342839]
We propose MASR, a Multi-label Aware Speech Representation learning framework.
MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information.
We show significant performance improvements for the MASR over other established benchmarks.
arXiv Detail & Related papers (2023-07-20T16:09:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.