Related papers: J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution

J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution

URL: http://arxiv.org/abs/2403.19259v1
Date: Thu, 28 Mar 2024 09:32:43 GMT
Title: J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Authors: Nobuhiro Ueda, Hideko Habe, Yoko Matsui, Akishige Yuguchi, Seiya Kawano, Yasutomo Kawanishi, Sadao Kurohashi, Koichiro Yoshino,
Abstract summary: In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. We propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3) Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home.
Score: 22.911318874589448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.

Related papers

Multimodal Coreference Resolution for Chinese Social Media Dialogues: Dataset and Benchmark Approach [21.475881921929236]
Multimodal coreference resolution (MCR) aims to identify mentions referring to the same entity across different modalities. We introduce TikTalkCoref, the first Chinese multimodal coreference dataset for social media in real-world scenarios.
arXiv Detail & Related papers (2025-04-19T15:15:59Z)
Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges [65.03196674816772]
Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarification Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models.
arXiv Detail & Related papers (2023-07-28T13:44:33Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System [25.17100881568308]
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system. We propose an end-to-end framework with the capability to extract necessary slot values from the utterance. We employ a multimodal hierarchical encoder using pre-trained DialoGPT to provide a stronger context for both tasks.
arXiv Detail & Related papers (2023-05-27T10:06:03Z)
Reference Resolution and Context Change in Multimodal Situated Dialogue for Exploring Data Visualizations [3.5813777917429515]
We focus on resolving references to visualizations on a large screen display in multimodal dialogue. We describe our annotations for user references to visualizations appearing on a large screen via language and hand gesture. We report results on detecting and resolving references, effectiveness of contextual information on the model, and under-specified requests for creating visualizations.
arXiv Detail & Related papers (2022-09-06T04:43:28Z)
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels. Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z)
Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains. We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z)
HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables. The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z)
Back to the Future: Bidirectional Information Decoupling Network for Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder. BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks. Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z)
YouRefIt: Embodied Reference Understanding with Language and Gesture [95.93218436323481]
We study the understanding of embodied reference. One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Crowd-sourced YouRefIt dataset contains 4,195 unique reference clips in 432 indoor scenes.
arXiv Detail & Related papers (2021-09-08T03:27:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.