OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
- URL: http://arxiv.org/abs/2403.19417v1
- Date: Thu, 28 Mar 2024 13:47:19 GMT
- Title: OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
- Authors: Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, Cewu Lu,
- Abstract summary: OAKINK2 is a dataset of bimanual object manipulation tasks for complex daily activities.
It introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task.
OakINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects.
- Score: 39.14950571922401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present OAKINK2, a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation, OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks, treating them as a sequence of object affordance fulfillment. The first level, Affordance, outlines the functionalities that objects in the scene can afford, the second level, Primitive Task, describes the minimal interaction units that humans interact with the object to achieve its affordance, and the third level, Complex Task, illustrates how Primitive Tasks are composed and interdependent. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. Based on the 3-level abstraction of OAKINK2, we explore a task-oriented framework for Complex Task Completion (CTC). CTC aims to generate a sequence of bimanual manipulation to achieve task objectives. Within the CTC framework, we employ Large Language Models (LLMs) to decompose the complex task objectives into sequences of Primitive Tasks and have developed a Motion Fulfillment Model that generates bimanual hand motion for each Primitive Task. OAKINK2 datasets and models are available at https://oakink.net/v2.
Related papers
- Leverage Task Context for Object Affordance Ranking [57.59106517732223]
We build the first large-scale task-oriented affordance ranking dataset with 25 common tasks, over 50k images and more than 661k objects.
Results demonstrate the feasibility of the task context based affordance learning paradigm and the superiority of our model over state-of-the-art models in the fields of saliency ranking and multimodal object detection.
arXiv Detail & Related papers (2024-11-25T04:22:33Z) - DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud.
Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks.
We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding [44.206222326616526]
TACO is an extensive dataset spanning a large variety of tool-action-object compositions for daily human activities.
TACO contains 2.5K motion sequences paired with third-person and egocentric views, precise hand-object 3D meshes, and action labels.
We benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis.
arXiv Detail & Related papers (2024-01-16T14:41:42Z) - Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks.
With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z) - ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.