Related papers: Video-Grounded Dialogues with Pretrained Generation Language Models

Video-Grounded Dialogues with Pretrained Generation Language Models

URL: http://arxiv.org/abs/2006.15319v1
Date: Sat, 27 Jun 2020 08:24:26 GMT
Title: Video-Grounded Dialogues with Pretrained Generation Language Models
Authors: Hung Le, Steven C.H. Hoi
Abstract summary: We leverage the power of pre-trained language models for improving video-grounded dialogue. We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task. Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
Score: 88.15419265622748
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.

Related papers

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [24.68804661538364]
Full spoken dialogue systems significantly mirror human-human interactions. achieving low latency and natural interactions is a significant challenge. End-to-end full-to-end spoken dialogue systems are a promising direction for developing efficient and natural end-to-end systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog [10.290057801577662]
OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
arXiv Detail & Related papers (2024-02-20T17:00:59Z)
SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization [48.284512017469524]
Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations. Traditional language models often overlook the distinct features of these dialogues by treating them as regular text. We propose a speaker-enhanced pre-training method for long dialogue summarization.
arXiv Detail & Related papers (2024-01-31T04:50:00Z)
Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs. We employ domain-adaptive training strategies to help the model adapt to the dialogue domains. Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z)
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video. The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs) We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z)
Back to the Future: Bidirectional Information Decoupling Network for Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder. BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks. Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z)
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos. Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces. BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.