Related papers: V-SAT: Video Subtitle Annotation Tool

V-SAT: Video Subtitle Annotation Tool

URL: http://arxiv.org/abs/2510.24180v1
Date: Tue, 28 Oct 2025 08:34:27 GMT
Title: V-SAT: Video Subtitle Annotation Tool
Authors: Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, Aritra Sen, Srushti Anil Patil, Vishwanathan Raman,
Abstract summary: V-SAT (Video Subtitle Tool) is a unified framework that automatically detects and corrects a wide range of subtitle quality issues.<n>By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video.
Score: 0.07388859384645263
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.

Related papers

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions [74.27249614046309]
ASID-1M is an open-source collection of one million structured, fine-grained audiovisual instruction annotations.<n>ASID-Verify is a scalable data curation pipeline for annotation.<n>ASID-Captioner is a video understanding model trained via Supervised Fine-Tuning.
arXiv Detail & Related papers (2026-02-13T15:20:54Z)
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning.<n>Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z)
Expertized Caption Auto-Enhancement for Video-Text Retrieval [10.250004732070494]
This paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.<n>Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability.<n>Our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo.
arXiv Detail & Related papers (2025-02-05T04:51:46Z)
Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.<n>Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.<n> Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z)
Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio. Our framework learns tri-modal representations in a unified self-supervised transformer. Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z)
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model. Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z)
Between Flexibility and Consistency: Joint Generation of Captions and Subtitles [13.58711830450618]
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
arXiv Detail & Related papers (2021-07-13T17:06:04Z)
Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.