Related papers: Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

URL: http://arxiv.org/abs/2512.02791v1
Date: Tue, 02 Dec 2025 14:08:47 GMT
Title: Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Authors: Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan, Massimo Poesio,
Abstract summary: Dialogue-Based Generalized Referring Expressions (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context.<n>Existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data.<n>We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding.
Score: 3.898807437481249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

Related papers

On Mitigating Data Sparsity in Conversational Recommender Systems [69.70761335240738]
Conversational recommender systems (CRSs) capture user preference through textual information in dialogues.<n>They suffer from data sparsity on two fronts: the dialogue space is vast and linguistically diverse, while the item space exhibits long-tail and sparse distributions.<n>Existing methods struggle with (1) generalizing to varied dialogue expressions due to underutilization of rich textual cues, and (2) learning informative item representations under severe sparsity.
arXiv Detail & Related papers (2025-07-01T06:54:51Z)
Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction [14.520176577205754]
We introduce a model-agnostic two-stage Consistency Reflection and Correction framework.<n>In the consistency reflection stage, the model is prompted to reflect on the discrepancies between generated responses and dialogue contexts.<n>In the consistency correction stage, the model generates responses that are more consistent with the dialogue context.
arXiv Detail & Related papers (2025-06-16T11:15:21Z)
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions [8.717610965852037]
We propose a novel training paradigm to generate diverse responses of a given proficiency level.<n>We convert responses into synthesized speech via speaker-aware text-to-speech synthesis.<n>A multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly.
arXiv Detail & Related papers (2025-06-04T15:42:53Z)
Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues [66.69453609603875]
Sociocultural norms serve as guiding principles for personal conduct in social interactions. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase.
arXiv Detail & Related papers (2024-10-04T00:08:46Z)
Instructive Dialogue Summarization with Query Aggregations [41.89962538701501]
We introduce instruction-finetuned language models to expand the capability set of dialogue summarization models. We propose a three-step approach to synthesize high-quality query-based summarization triples. By training a unified model called InstructDS on three summarization datasets with multi-purpose instructive triples, we expand the capability of dialogue summarization models.
arXiv Detail & Related papers (2023-10-17T04:03:00Z)
Grounding Description-Driven Dialogue State Trackers with Knowledge-Seeking Turns [54.56871462068126]
Augmenting the training set with human or synthetic schema paraphrases improves the model robustness to these variations but can be either costly or difficult to control. We propose to circumvent these issues by grounding the state tracking model in knowledge-seeking turns collected from the dialogue corpus as well as the schema.
arXiv Detail & Related papers (2023-09-23T18:33:02Z)
SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies. We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z)
DialogueCSE: Dialogue-based Contrastive Learning of Sentence Embeddings [33.89889949577356]
We propose DialogueCSE, a dialogue-based contrastive learning approach to tackle this issue. We evaluate our model on three multi-turn dialogue datasets: the Microsoft Dialogue Corpus, the Jing Dong Dialogue Corpus, and the E-commerce Dialogue Corpus.
arXiv Detail & Related papers (2021-09-26T13:25:41Z)
I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling [104.09033240889106]
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues. We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach.
arXiv Detail & Related papers (2020-12-24T18:47:49Z)
Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data. Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.