Related papers: J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception

J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception

URL: http://arxiv.org/abs/2510.21761v1
Date: Mon, 13 Oct 2025 04:53:46 GMT
Title: J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
Authors: Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino,
Abstract summary: J-ORA is a novel dataset that bridges the gap in robot perception by providing detailed object attribute annotations.<n>It supports three critical perception tasks, object identification, reference resolution, and next-action prediction.
Score: 55.8311080124569
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object functionality and contextual relationships across different VLMs. These findings underscore the importance of rich, context-sensitive attribute annotations in advancing robot perception in dynamic environments. See project page at https://jatuhurrra.github.io/J-ORA/.

Related papers

RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba [86.47790050206306]
RefAVA++ comprises >2.9 million frames and >75.1k annotated persons in total.<n> RefAtomNet++ advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism.<n>Experiments show that RefAtomNet++ establishes new state-of-the-art results.
arXiv Detail & Related papers (2025-10-18T10:41:19Z)
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing [3.3072144045024396]
EagleVision is an MLLM tailored for remote sensing that excels in object detection and attribute comprehension.<n>We construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning.<n>EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks.
arXiv Detail & Related papers (2025-03-30T06:13:13Z)
Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.<n>CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.<n>Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z)
Vision-Language Models Struggle to Align Entities across Modalities [13.100184125419695]
Cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation.<n>Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations.<n>We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find thatVLMs struggle significantly compared to humans.
arXiv Detail & Related papers (2025-03-05T19:36:43Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Task Me Anything [72.810309406219]
This paper produces a benchmark tailored to a user's needs.<n>It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships.<n>It can generate 750M image/video question-answering pairs, which focus on evaluating perceptual capabilities.
arXiv Detail & Related papers (2024-06-17T17:32:42Z)
Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response. Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods. We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.