Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform
- URL: http://arxiv.org/abs/2504.15182v1
- Date: Mon, 21 Apr 2025 15:44:06 GMT
- Title: Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform
- Authors: Xianpan Zhou,
- Abstract summary: Tiger200K is a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms.<n>By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation.<n>The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: https://tinytigerpan.github.io/tiger200k/
Related papers
- PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding [126.15907330726067]
We build a Perception Model Language (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.
We analyze standard training pipelines without distillation from models and explore large-scale synthetic data to identify critical data gaps.
arXiv Detail & Related papers (2025-04-17T17:59:56Z) - Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos [15.781862060265519]
CFC-VIDS-1M is a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline.
We develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms.
arXiv Detail & Related papers (2025-02-28T18:56:35Z) - VideoAuteur: Towards Long Narrative Video Generation [22.915448471769384]
We present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain.
We introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos.
Our method demonstrates substantial improvements in generating visually detailed and semantically aligneds.
arXiv Detail & Related papers (2025-01-10T18:52:11Z) - Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.
We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv Detail & Related papers (2024-10-17T16:22:46Z) - Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content [35.02160595617654]
temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality.
We introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality.
arXiv Detail & Related papers (2024-10-10T17:57:49Z) - VidGen-1M: A Large-Scale Dataset for Text-to-video Generation [9.726156628112198]
We present VidGen-1M, a superior training dataset for text-to-video models.
This dataset guarantees high-quality videos and detailed captions with excellent temporal consistency.
When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.
arXiv Detail & Related papers (2024-08-05T16:53:23Z) - Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - Detecting AI-Generated Video via Frame Consistency [25.290019967304616]
We propose an open-source dataset and a detection method for generated video for the first time.<n>First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions.<n>Second, we found via probing experiments that spatial artifact-based detectors lack generalizability.
arXiv Detail & Related papers (2024-02-03T08:52:06Z) - Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.