Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation
- URL: http://arxiv.org/abs/2211.03940v1
- Date: Tue, 8 Nov 2022 01:23:59 GMT
- Title: Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation
- Authors: Satwik Kottur, Seungwhan Moon, Aram H. Markosyan, Hardik Shah, Babak
Damavandi, Alborz Geramifard
- Abstract summary: We propose task-oriented dialogs for montage creation as a novel interactive tool to seamlessly search, compile, and edit montages from a media collection.
We collect a new dataset C3 (Conversational Content Creation), comprising 10k dialogs conditioned on media montages simulated from a large media collection.
Our analysis and benchmarking of state-of-the-art language models showcase the multimodal challenges present in the dataset.
- Score: 11.538915414185022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: People capture photos and videos to relive and share memories of personal
significance. Recently, media montages (stories) have become a popular mode of
sharing these memories due to their intuitive and powerful storytelling
capabilities. However, creating such montages usually involves a lot of manual
searches, clicks, and selections that are time-consuming and cumbersome,
adversely affecting user experiences.
To alleviate this, we propose task-oriented dialogs for montage creation as a
novel interactive tool to seamlessly search, compile, and edit montages from a
media collection. To the best of our knowledge, our work is the first to
leverage multi-turn conversations for such a challenging application, extending
the previous literature studying simple media retrieval tasks. We collect a new
dataset C3 (Conversational Content Creation), comprising 10k dialogs
conditioned on media montages simulated from a large media collection.
We take a simulate-and-paraphrase approach to collect these dialogs to be
both cost and time efficient, while drawing from natural language distribution.
Our analysis and benchmarking of state-of-the-art language models showcase the
multimodal challenges present in the dataset. Lastly, we present a real-world
mobile demo application that shows the feasibility of the proposed work in
real-world applications. Our code and data will be made publicly available.
Related papers
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval [9.899703354116962]
Dense video captioning aims to automatically localize and caption all events within untrimmed video.
We propose a novel framework inspired by the cognitive information processing of humans.
Our model utilizes external memory to incorporate prior knowledge.
arXiv Detail & Related papers (2024-04-11T09:58:23Z) - OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog [10.290057801577662]
OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker.
It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
arXiv Detail & Related papers (2024-02-20T17:00:59Z) - SOVC: Subject-Oriented Video Captioning [59.04029220586337]
We propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box.
To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - Video Summarization: Towards Entity-Aware Captions [75.71891605682931]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Navigating Connected Memories with a Task-oriented Dialog System [13.117491508194242]
We propose dialogs for connected memories as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation.
We use a new task-oriented dialog dataset COMET, which contains $11.5k$ user->assistant dialogs (totaling $103k$ utterances) grounded in simulated personal memory graphs.
We analyze COMET, formulate four main tasks to benchmark meaningful progress, and adopt state-of-the-art language models as strong baselines.
arXiv Detail & Related papers (2022-11-15T19:31:57Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - DialogLM: Pre-trained Model for Long Dialogue Understanding and
Summarization [19.918194137007653]
We present a pre-training framework for long dialogue understanding and summarization.
Considering the nature of long conversations, we propose a window-based denoising approach for generative pre-training.
We conduct extensive experiments on five datasets of long dialogues, covering tasks of dialogue summarization, abstractive question answering and topic segmentation.
arXiv Detail & Related papers (2021-09-06T13:55:03Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.