Related papers: CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

URL: http://arxiv.org/abs/2503.19281v1
Date: Tue, 25 Mar 2025 02:23:47 GMT
Title: CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model
Authors: Feiyang Wang, Xiaomin Yu, Wangyu Wu,
Abstract summary: We introduce CubeRobot, a novel vision-language model (VLM) tailored for solving 3x3 Rubik's Cubes.<n>We incorporate a dual-loop VisionCoT architecture and Memory Stream, a paradigm for extracting task-related features from VLM-generated planning queries.<n>In low-level Rubik's Cube restoration tasks, CubeRobot achieved a high accuracy rate of 100%, similar to 100% in medium-level tasks, and achieved an accuracy rate of 80% in high-level tasks.
Score: 1.644433638087587
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Proving Rubik's Cube theorems at the high level represents a notable milestone in human-level spatial imagination and logic thinking and reasoning. Traditional Rubik's Cube robots, relying on complex vision systems and fixed algorithms, often struggle to adapt to complex and dynamic scenarios. To overcome this limitation, we introduce CubeRobot, a novel vision-language model (VLM) tailored for solving 3x3 Rubik's Cubes, empowering embodied agents with multimodal understanding and execution capabilities. We used the CubeCoT image dataset, which contains multiple-level tasks (43 subtasks in total) that humans are unable to handle, encompassing various cube states. We incorporate a dual-loop VisionCoT architecture and Memory Stream, a paradigm for extracting task-related features from VLM-generated planning queries, thus enabling CubeRobot to independent planning, decision-making, reflection and separate management of high- and low-level Rubik's Cube tasks. Furthermore, in low-level Rubik's Cube restoration tasks, CubeRobot achieved a high accuracy rate of 100%, similar to 100% in medium-level tasks, and achieved an accuracy rate of 80% in high-level tasks.

Related papers

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z)
Node Classification and Search on the Rubik's Cube Graph with GNNs [55.2480439325792]
This study focuses on the application of deep geometric models to solve the 3x3x3 Rubik's Rubik. We begin by discussing the cube's graph representation and defining distance as the model's optimization objective. The distance approximation task is reformulated as a node classification problem, effectively addressed using Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2025-01-30T18:52:43Z)
CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution [55.94314421887744]
Lightweight image super-resolution (SR) methods aim at increasing the resolution and restoring the details of an image using a lightweight neural network.<n>Our analysis reveals that these methods are hindered by constrained feature diversity, which adversely impacts feature representation and detail recovery.<n>We propose a simple yet effective baseline called CubeFormer, designed to enhance feature richness by completing holistic information aggregation.
arXiv Detail & Related papers (2024-12-03T08:02:26Z)
Solving a Rubik's Cube Using its Local Graph Structure [13.219469732742354]
A Rubix Cube has six faces and twelve possible actions, leading to a small and unconstrained action space. A Rubix Cube can be represented as a graph, where states of the cube are nodes and actions are edges. Drawing on graph convolutional networks, we design a new search algorithm to find the solution to a scrambled Rubix Cube.
arXiv Detail & Related papers (2024-08-15T05:39:52Z)
Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z)
Towards Learning Rubik's Cube with N-tuple-based Reinforcement Learning [0.0]
This work describes in detail how to learn and solve the Rubik's cube game (or puzzle) in the General Board Game (GBG) learning and playing framework. We describe the cube's state representation, how to transform it with twists, wholecube rotations and color transformations and explain the use of symmetries in Rubik's cube.
arXiv Detail & Related papers (2023-01-28T11:38:10Z)
Are Deep Neural Networks SMARTer than Second Graders? [85.60342335636341]
We evaluate the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning. Experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization.
arXiv Detail & Related papers (2022-12-20T04:33:32Z)
A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach [59.89749342550104]
We propose the task of hyper-relational extraction to extract more specific and complete facts from text. Existing models cannot perform hyper-relational extraction as it requires a model to consider the interaction between three entities. We propose CubeRE, a cube-filling model inspired by table-filling approaches and explicitly considers the interaction between relation triplets and qualifiers.
arXiv Detail & Related papers (2022-11-18T03:51:28Z)
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task. Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z)
Benchmarking Robot Manipulation with the Rubik's Cube [15.922643222904172]
We propose Rubik's cube manipulation as a benchmark to measure simultaneous performance of precise manipulation and sequential manipulation. We present a protocol for quantitatively measuring both the accuracy and speed of Rubik's cube manipulation. We demonstrate this protocol for two distinct baseline approaches on a PR2 robot.
arXiv Detail & Related papers (2022-02-14T22:34:18Z)
CubeTR: Learning to Solve The Rubiks Cube Using Transformers [0.0]
The Rubiks cube has a single solved state for quintillions of possible configurations which leads to extremely sparse rewards. The proposed model CubeTR attends to longer sequences of actions and addresses the problem of sparse rewards.
arXiv Detail & Related papers (2021-11-11T03:17:28Z)
Self-Supervision is All You Need for Solving Rubik's Cube [0.0]
This work introduces a simple and efficient deep learning method for solving problems with a predefined goal, represented by Rubik's Cube. We demonstrate that, for such problems, training a deep neural network on random scrambles branching from the goal state is sufficient to achieve near-optimal solutions.
arXiv Detail & Related papers (2021-06-06T15:38:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.