MONAH: Multi-Modal Narratives for Humans to analyze conversations
- URL: http://arxiv.org/abs/2101.07339v2
- Date: Wed, 20 Jan 2021 02:25:23 GMT
- Title: MONAH: Multi-Modal Narratives for Humans to analyze conversations
- Authors: Joshua Y. Kim, Greyson Y. Kim, Chunfeng Liu, Rafael A. Calvo, Silas
C.R. Taylor, Kalina Yacef
- Abstract summary: We introduce a system that automatically expands the verbatim transcripts of video-recorded conversations using multimodal data streams.
This system uses a set of preprocessing rules to weave multimodal annotations into the verbatim transcripts and promote interpretability.
- Score: 9.178828168133206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In conversational analyses, humans manually weave multimodal information into
the transcripts, which is significantly time-consuming. We introduce a system
that automatically expands the verbatim transcripts of video-recorded
conversations using multimodal data streams. This system uses a set of
preprocessing rules to weave multimodal annotations into the verbatim
transcripts and promote interpretability. Our feature engineering contributions
are two-fold: firstly, we identify the range of multimodal features relevant to
detect rapport-building; secondly, we expand the range of multimodal
annotations and show that the expansion leads to statistically significant
improvements in detecting rapport-building.
Related papers
- Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques.
We show that speech-based multi-modal retrieval outperforms text based retrieval.
We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
Learning [120.95150400119705]
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD)
MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner.
We introduce the first segment-based evaluator for recurrent text generation.
arXiv Detail & Related papers (2023-11-29T08:27:00Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MM-AU:Towards Multimodal Understanding of Advertisement Videos [38.117243603403175]
We introduce a multimodal multilingual benchmark called MM-AU composed of over 8.4K videos (147 hours) curated from multiple web sources.
We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts.
arXiv Detail & Related papers (2023-08-27T09:11:46Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.