See, Say, and Segment: Teaching LMMs to Overcome False Premises
- URL: http://arxiv.org/abs/2312.08366v1
- Date: Wed, 13 Dec 2023 18:58:04 GMT
- Title: See, Say, and Segment: Teaching LMMs to Overcome False Premises
- Authors: Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta,
Xudong Wang, Joseph E. Gonzalez, Trevor Darrell
- Abstract summary: We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
- Score: 67.36381001664635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current open-source Large Multimodal Models (LMMs) excel at tasks such as
open-vocabulary language grounding and segmentation but can suffer under false
premises when queries imply the existence of something that is not actually
present in the image. We observe that existing methods that fine-tune an LMM to
segment images significantly degrade their ability to reliably determine
("see") if an object is present and to interact naturally with humans ("say"),
a form of catastrophic forgetting. In this work, we propose a cascading and
joint training approach for LMMs to solve this task, avoiding catastrophic
forgetting of previous skills. Our resulting model can "see" by detecting
whether objects are present in an image, "say" by telling the user if they are
not, proposing alternative queries or correcting semantic errors in the query,
and finally "segment" by outputting the mask of the desired objects if they
exist. Additionally, we introduce a novel False Premise Correction benchmark
dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets
(which we call FP-RefCOCO(+/g)). The results show that our method not only
detects false premises up to 55% better than existing approaches, but under
false premise conditions produces relative cIOU improvements of more than 31%
over baselines, and produces natural language feedback judged helpful up to 67%
of the time.
Related papers
- Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese [3.724862061593193]
The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE)
We propose the first comprehensive FCE benchmark emphFace4RAG for RAG independent of the underlying Large Language Models (LLMs)
On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference.
arXiv Detail & Related papers (2024-07-01T08:35:04Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs.
We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs.
We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Reflection Invariance Learning for Few-shot Semantic Segmentation [53.20466630330429]
Few-shot semantic segmentation (FSS) aims to segment objects of unseen classes in query images with only a few annotated support images.
This paper proposes a fresh few-shot segmentation framework to mine the reflection invariance in a multi-view matching manner.
Experiments on both PASCAL-$5textiti$ and COCO-$20textiti$ datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-01T15:14:58Z) - Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model [14.98695074168234]
We propose a new method to detect machine-generated text, especially from large language models (LLMs)
We use a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency.
Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget.
arXiv Detail & Related papers (2023-05-26T04:23:10Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - ReWOO: Decoupling Reasoning from Observations for Efficient Augmented
Language Models [32.95155349925248]
We propose a modular paradigm ReWOO that detaches the reasoning process from external observations, thus significantly reducing token consumption.
We show that ReWOO achieves 5x token efficiency and 4% accuracy improvement on HotpotQA, a multi-step reasoning benchmark.
Our illustrative work offloads reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant potential for truly efficient and scalable ALM systems.
arXiv Detail & Related papers (2023-05-23T00:16:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.