Related papers: Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

URL: http://arxiv.org/abs/2402.11057v2
Date: Wed, 28 Feb 2024 16:42:12 GMT
Title: Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos
Authors: Shijia Feng, Michael Wray, Brian Sullivan, Youngkyoon Jang, Casimir Ludwig, Iain Gilchrist, and Walterio Mayol-Cuevas
Abstract summary: We present a new dataset with three assembly activities and corresponding performance baselines for the determination of struggle from video. Video segments were scored w.r.t. the level of struggle as perceived by annotators using a forced choice 4-point scale. The dataset is the first struggle annotation dataset and contains 5.1 hours of video and 725,100 frames from 73 participants in total.
Score: 4.631245639292796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Determining when people are struggling from video enables a finer-grained understanding of actions and opens opportunities for building intelligent support visual interfaces. In this paper, we present a new dataset with three assembly activities and corresponding performance baselines for the determination of struggle from video. Three real-world problem-solving activities including assembling plumbing pipes (Pipes-Struggle), pitching camping tents (Tent-Struggle) and solving the Tower of Hanoi puzzle (Tower-Struggle) are introduced. Video segments were scored w.r.t. the level of struggle as perceived by annotators using a forced choice 4-point scale. Each video segment was annotated by a single expert annotator in addition to crowd-sourced annotations. The dataset is the first struggle annotation dataset and contains 5.1 hours of video and 725,100 frames from 73 participants in total. We evaluate three decision-making tasks: struggle classification, struggle level regression, and struggle label distribution learning. We provide baseline results for each of the tasks utilising several mainstream deep neural networks, along with an ablation study and visualisation of results. Our work is motivated toward assistive systems that analyze struggle, support users during manual activities and encourage learning, as well as other video understanding competencies.

Related papers

VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering [14.039561301034848]
Cross-video question answering presents significant challenges beyond traditional single-video understanding.<n>We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning.<n>Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training.
arXiv Detail & Related papers (2025-08-05T03:33:24Z)
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z)
2by2: Weakly-Supervised Learning for Global Action Segmentation [4.880243880711163]
This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation. We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation. For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity.
arXiv Detail & Related papers (2024-12-17T11:49:36Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
5th Place Solution for YouTube-VOS Challenge 2022: Video Object Segmentation [4.004851693068654]
Video object segmentation (VOS) has made significant progress with the rise of deep learning. Similar objects are easily confused and tiny objects are difficult to find. We propose a simple yet effective solution for this task.
arXiv Detail & Related papers (2022-06-20T06:14:27Z)
Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z)
A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications. Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
Is this Harmful? Learning to Predict Harmfulness Ratings from Video [15.059547998989537]
We create a dataset of approximately 4000 video clips, annotated by professionals in the field. We conduct an in-depth study on our modeling choices and find that we greatly benefit from combining the visual and audio modality. Our dataset will be made available upon publication.
arXiv Detail & Related papers (2021-06-15T17:57:12Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
Representation learning from videos in-the-wild: An object-centric approach [40.46013713992305]
We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video.
arXiv Detail & Related papers (2020-10-06T15:17:45Z)
Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework. We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation [100.68317848808327]
We present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis. COIN dataset contains 11,827 videos of 180 tasks in 12 domains related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries.
arXiv Detail & Related papers (2020-03-20T16:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.