J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
- URL: http://arxiv.org/abs/2510.21761v1
- Date: Mon, 13 Oct 2025 04:53:46 GMT
- Title: J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
- Authors: Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino,
- Abstract summary: J-ORA is a novel dataset that bridges the gap in robot perception by providing detailed object attribute annotations.<n>It supports three critical perception tasks, object identification, reference resolution, and next-action prediction.
- Score: 55.8311080124569
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object functionality and contextual relationships across different VLMs. These findings underscore the importance of rich, context-sensitive attribute annotations in advancing robot perception in dynamic environments. See project page at https://jatuhurrra.github.io/J-ORA/.
Related papers
- RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba [86.47790050206306]
RefAVA++ comprises >2.9 million frames and >75.1k annotated persons in total.<n> RefAtomNet++ advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism.<n>Experiments show that RefAtomNet++ establishes new state-of-the-art results.
arXiv Detail & Related papers (2025-10-18T10:41:19Z) - EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing [3.3072144045024396]
EagleVision is an MLLM tailored for remote sensing that excels in object detection and attribute comprehension.<n>We construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning.<n>EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks.
arXiv Detail & Related papers (2025-03-30T06:13:13Z) - Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.<n>CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.<n>Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z) - Vision-Language Models Struggle to Align Entities across Modalities [13.100184125419695]
Cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation.<n>Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations.<n>We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find thatVLMs struggle significantly compared to humans.
arXiv Detail & Related papers (2025-03-05T19:36:43Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Task Me Anything [72.810309406219]
This paper produces a benchmark tailored to a user's needs.<n>It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships.<n>It can generate 750M image/video question-answering pairs, which focus on evaluating perceptual capabilities.
arXiv Detail & Related papers (2024-06-17T17:32:42Z) - Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response.
Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods.
We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.