Related papers: MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

URL: http://arxiv.org/abs/2409.14801v1
Date: Mon, 23 Sep 2024 08:26:08 GMT
Title: MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
Authors: Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine,
Abstract summary: Critical moments, such as emotional outbursts or changes in decisions during conversations, are crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs) We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points.
Score: 30.9157728847139
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.

Related papers

MTPChat: A Multimodal Time-Aware Persona Dataset for Conversational Agents [23.98067169669452]
MTPChat is a time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. We propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP) We present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies.
arXiv Detail & Related papers (2025-02-09T13:00:53Z)
Cross-Modal Consistency in Multimodal Large Language Models [33.229271701817616]
We introduce a novel concept termed cross-modal consistency. Our experimental findings reveal a pronounced inconsistency between the vision and language modalities within GPT-4V. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
arXiv Detail & Related papers (2024-11-14T08:22:42Z)
KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
MIMIC-IT: Multi-Modal In-Context Instruction Tuning [44.879418596312554]
We present a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Using the MIMIC-IT dataset, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
arXiv Detail & Related papers (2023-06-08T17:59:56Z)
HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer [38.92436852096451]
Cross-person memory Transformer (CPM-T) framework is able to explicitly model affective dynamics. CPM-T framework maintains memory modules to store and update the contexts within the conversation window. We evaluate the effectiveness and generalizability of our approach on three publicly available datasets for joint engagement, rapport, and human beliefs prediction tasks.
arXiv Detail & Related papers (2023-05-21T06:43:35Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features. We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors. Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z)
Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation. Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z)
MOPT: Multi-Object Panoptic Tracking [33.77171216778909]
We introduce a novel perception task denoted as multi-object panoptic tracking (MOPT) MOPT allows for exploiting pixel-level semantic information of 'thing' and'stuff' classes, temporal coherence, and pixel-level associations over time. We present extensive quantitative and qualitative evaluations of both vision-based and LiDAR-based MOPT that demonstrate encouraging results.
arXiv Detail & Related papers (2020-04-17T11:45:28Z)
Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim. We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting. Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.