NarrationBot and InfoBot: A Hybrid System for Automated Video
Description
- URL: http://arxiv.org/abs/2111.03994v1
- Date: Sun, 7 Nov 2021 04:13:30 GMT
- Title: NarrationBot and InfoBot: A Hybrid System for Automated Video
Description
- Authors: Shasta Ihorn, Yue-Ting Siu, Aditya Bodi, Lothar Narins, Jose M.
Castanon, Yash Kant, Abhishek Das, Ilmi Yoon, Pooyan Fazli
- Abstract summary: We develop a hybrid system of two tools to automatically generate descriptions for videos.
We show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem.
Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos.
- Score: 9.59921187620835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video accessibility is crucial for blind and low vision users for equitable
engagements in education, employment, and entertainment. Despite the
availability of professional and amateur services and tools, most
human-generated descriptions are expensive and time consuming. Moreover, the
rate of human-generated descriptions cannot match the speed of video
production. To overcome the increasing gaps in video accessibility, we
developed a hybrid system of two tools to 1) automatically generate
descriptions for videos and 2) provide answers or additional descriptions in
response to user queries on a video. Results from a mixed-methods study with 26
blind and low vision individuals show that our system significantly improved
user comprehension and enjoyment of selected videos when both tools were used
in tandem. In addition, participants reported no significant difference in
their ability to understand videos when presented with autogenerated
descriptions versus human-revised autogenerated descriptions. Our results
demonstrate user enthusiasm about the developed system and its promise for
providing customized access to videos. We discuss the limitations of the
current work and provide recommendations for the future development of
automated video description tools.
Related papers
- ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents [30.228721661677493]
RealVideoQuest is designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries.<n>It identifies 7.5K real user queries with video response intents and builds 4.5K high-quality query-video pairs.<n>Experiments indicate that current T2V models struggle with effectively addressing real user queries.
arXiv Detail & Related papers (2025-06-02T13:52:21Z) - Vid2Coach: Transforming How-To Videos into Task Assistants [51.729869497134885]
We propose Vid2Coach, a system that transforms how-to videos into wearable camera-based assistants.<n> Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step.<n>It then uses retrieval-augmented-generation to extract relevant non-visual workarounds from BLV-specific resources.
arXiv Detail & Related papers (2025-05-31T21:28:50Z) - VideoMix: Aggregating How-To Videos for Task-Oriented Learning [36.183779096566276]
VideoMix is a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task.
Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips.
arXiv Detail & Related papers (2025-03-27T03:43:02Z) - GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning [62.775721264492994]
GRADEO is one of the first specifically designed video evaluation models.
It grades AI-generated videos for explainable scores and assessments through multi-step reasoning.
Experiments show that our method aligns better with human evaluations than existing methods.
arXiv Detail & Related papers (2025-03-04T07:04:55Z) - ExpertAF: Expert Actionable Feedback from Video [81.46431188306397]
We introduce a novel method to generate actionable feedback from video of a person doing a physical activity.
Our method takes a video demonstration and its accompanying 3D body pose and generates expert commentary.
Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching.
arXiv Detail & Related papers (2024-08-01T16:13:07Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing.
RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video.
Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial.
This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition [48.456059482589495]
We present a new learning approach, Soft Conditional Prompt Learning ( SCP), which leverages the strengths of prompt learning for aerial video action recognition.
Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception.
arXiv Detail & Related papers (2023-05-21T11:51:09Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions [30.650879247687747]
Video captioning to convey dynamic scenes from videos advances the understanding of using natural language.
In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehensive video descriptions.
arXiv Detail & Related papers (2023-04-09T12:46:18Z) - CLUE: Contextualised Unified Explainable Learning of User Engagement in
Video Lectures [6.25256391074865]
We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos.
Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
arXiv Detail & Related papers (2022-01-14T19:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.