PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
- URL: http://arxiv.org/abs/2505.09990v2
- Date: Fri, 16 May 2025 18:37:26 GMT
- Title: PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
- Authors: Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, Rose Hendrix, Noah A. Smith, Fei Xia, Dieter Fox, Ranjay Krishna,
- Abstract summary: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts.<n>We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios.
- Score: 79.80132157576978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/
Related papers
- MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping [8.558823208942277]
MultiGraspNet is a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework.<n>We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline.
arXiv Detail & Related papers (2026-02-06T08:56:21Z) - Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments [11.490236862362801]
We study the ability of models to identify the most important sub-events in a video.<n>We evaluate models on their ability to distinguish between important and non-important sub-events in a game.
arXiv Detail & Related papers (2026-01-22T21:40:08Z) - AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding [39.64540328712615]
Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications.<n>Existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations.<n>We introduce the Point-It-Out benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding.
arXiv Detail & Related papers (2025-09-30T05:05:54Z) - Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [25.31489336119893]
We systematically review the applications of multimodal fusion in key robotic vision tasks.<n>We compare vision-language models (VLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies.<n>We identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Multimodal CLIP Inference for Meta-Few-Shot Image Classification [0.0]
Multimodal foundation models like CLIP learn a joint (image, text) embedding.
This study demonstrates that combining modalities from CLIP's text and image encoders outperforms state-of-the-art meta-few-shot learners on widely adopted benchmarks.
arXiv Detail & Related papers (2024-03-26T17:47:54Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.