IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants
- URL: http://arxiv.org/abs/2511.19684v1
- Date: Mon, 24 Nov 2025 20:45:17 GMT
- Title: IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants
- Authors: Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, Jörg Krüger,
- Abstract summary: IndEgo is a multimodal egocentric and exocentric dataset addressing common industrial tasks.<n>The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings.<n>A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks.
- Score: 7.869752673792282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo
Related papers
- Tracking and Segmenting Anything in Any Modality [75.32774085793498]
We propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input.<n> SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
arXiv Detail & Related papers (2025-11-22T09:09:22Z) - Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views [5.723697351415207]
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives.<n>Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen.<n>The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions.
arXiv Detail & Related papers (2025-10-26T13:27:59Z) - Leverage Task Context for Object Affordance Ranking [57.59106517732223]
We build the first large-scale task-oriented affordance ranking dataset with 25 common tasks, over 50k images and more than 661k objects.
Results demonstrate the feasibility of the task context based affordance learning paradigm and the superiority of our model over state-of-the-art models in the fields of saliency ranking and multimodal object detection.
arXiv Detail & Related papers (2024-11-25T04:22:33Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Weakly Supervised Multi-Task Representation Learning for Human Activity
Analysis Using Wearables [2.398608007786179]
We propose a weakly supervised multi-output siamese network that learns to map the data into multiple representation spaces.
The representation of the data samples are positioned in the space such that the data with the same semantic meaning in that aspect are closely located to each other.
arXiv Detail & Related papers (2023-08-06T08:20:07Z) - ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Do I Have Your Attention: A Large Scale Engagement Prediction Dataset
and Baselines [9.896915478880635]
The degree of concentration, enthusiasm, optimism, and passion displayed by individual(s) while interacting with a machine is referred to as user engagement'
To create engagement prediction systems that can work in real-world conditions, it is quintessential to learn from rich, diverse datasets.
Large scale multi-faceted engagement in the wild dataset EngageNet is proposed.
arXiv Detail & Related papers (2023-02-01T13:25:54Z) - Egocentric Video Task Translation [109.30649877677257]
We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition.
arXiv Detail & Related papers (2022-12-13T00:47:13Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Taskology: Utilizing Task Relations at Scale [28.09712466727001]
We show that we can leverage the inherent relationships among collections of tasks, as they are trained jointly.
explicitly utilizing the relationships between tasks allows improving their performance while dramatically reducing the need for labeled data.
We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion and ego-motion estimation, and object tracking and 3D detection in point clouds.
arXiv Detail & Related papers (2020-05-14T22:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.