CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
- URL: http://arxiv.org/abs/2303.09713v2
- Date: Wed, 16 Aug 2023 08:17:02 GMT
- Title: CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
- Authors: Seungju Han, Jack Hessel, Nouha Dziri, Yejin Choi, Youngjae Yu
- Abstract summary: We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts.
To train CHAMPAGNE, we collect and release-18M, a large-scale corpus of 18M video-based dialogues.
- Score: 75.37313546008639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual information is central to conversation: body gestures and physical
behaviour, for example, contribute to meaning that transcends words alone. To
date, however, most neural conversational models are limited to just text. We
introduce CHAMPAGNE, a generative model of conversations that can account for
visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a
large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from
web videos: crucial to our data collection pipeline is a pretrained language
model that converts error-prone automatic transcripts to a cleaner dialogue
format while maintaining meaning. Human evaluation reveals that YTD-18M is more
sensible and specific than prior resources (MMDialog, 1M dialogues), while
maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE
learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it
achieves state-of-the-art results on four vision-language tasks focused on
real-world conversations. We release data, models, and code.
Related papers
- Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules.
An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics.
We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z) - REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.
We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.
Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z) - Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented
Instruction Tuning for Digital Human [76.62897301298699]
ChatPLUG is a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format.
We show that modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation.
We deploy modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference.
arXiv Detail & Related papers (2023-04-16T18:16:35Z) - TANet: Thread-Aware Pretraining for Abstractive Conversational
Summarization [27.185068253347257]
We build a large-scale (11M) pretraining dataset called RCS based on the multi-person discussions in the Reddit community.
We then present TANet, a thread-aware Transformer-based network.
Unlike the existing pre-trained models that treat a conversation as a sequence of sentences, we argue that the inherent contextual dependency plays an essential role in understanding the entire conversation.
arXiv Detail & Related papers (2022-04-09T16:08:46Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.