New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration
- URL: http://arxiv.org/abs/2502.20104v2
- Date: Fri, 28 Feb 2025 07:36:32 GMT
- Title: New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration
- Authors: Xuzheng Yang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen,
- Abstract summary: Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>We introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships.<n>Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model's ability to reject non-existent targets.
- Score: 49.180693704510006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. To advance this field, we introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships. Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model's ability to reject non-existent targets, an often-overlooked yet critical challenge in existing datasets. To address fine-grained compositional REC, we propose novel methods based on a Specialist-MLLM collaboration framework, leveraging the complementary strengths of them: Specialist Models handle simpler tasks efficiently, while MLLMs are better suited for complex reasoning. Based on this synergy, we introduce two collaborative strategies. The first, Slow-Fast Adaptation (SFA), employs a routing mechanism to adaptively delegate simple tasks to Specialist Models and complex tasks to MLLMs. Additionally, common error patterns in both models are mitigated through a target-refocus strategy. The second, Candidate Region Selection (CRS), generates multiple bounding box candidates based on Specialist Model and uses the advanced reasoning capabilities of MLLMs to identify the correct target. Extensive experiments on our dataset and other challenging compositional benchmarks validate the effectiveness of our approaches. The SFA strategy achieves a trade-off between localization accuracy and efficiency, and the CRS strategy greatly boosts the performance of both Specialist Models and MLLMs. We aim for this work to offer valuable insights into solving complex real-world tasks by strategically combining existing tools for maximum effectiveness, rather than reinventing them.
Related papers
- Magneto: Combining Small and Large Language Models for Schema Matching [8.387623375871055]
Small language models (SLMs) require training data and large language models (LLMs) often incur high computational costs.<n>We present Magneto, a cost-effective and accurate solution for schema matching.
arXiv Detail & Related papers (2024-12-11T08:35:56Z) - KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs [14.376057807754668]
Large language models (LLMs) suffer from hallucinations and confusion about task instructions.<n>This study presents the Knowledge-Compliant Matching Framework (KcMF) that addresses these issues without the need for domain-specific fine-tuning.
arXiv Detail & Related papers (2024-10-16T11:50:02Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.<n>We have established a new REC dataset characterized by two key features.<n>It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs)
MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task.
We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z) - MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic [6.46176287368784]
We propose textbfModel textbfExclusive textbfTask textbfArithmetic for merging textbfGPT-scale models.
Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.
arXiv Detail & Related papers (2024-06-17T10:12:45Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - On Evaluating the Integration of Reasoning and Action in LLM Agents with
Database Question Answering [25.57202500348071]
This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models interact with a database.
The task requires LLMs to strategically generate multiplesql queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative.
We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction.
arXiv Detail & Related papers (2023-11-16T09:55:07Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.