InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
- URL: http://arxiv.org/abs/2412.09283v1
- Date: Thu, 12 Dec 2024 13:48:40 GMT
- Title: InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
- Authors: Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai,
- Abstract summary: Training relies on video-caption paired data, which plays a crucial role in enhancing generation performance.
Current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction.
We propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption.
- Score: 40.07877801495546
- License:
- Abstract: Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.
Related papers
- VideoAuteur: Towards Long Narrative Video Generation [22.915448471769384]
We present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain.
We introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos.
Our method demonstrates substantial improvements in generating visually detailed and semantically aligneds.
arXiv Detail & Related papers (2025-01-10T18:52:11Z) - Progress-Aware Video Frame Captioning [55.23366888264651]
We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.
We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.
Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
arXiv Detail & Related papers (2024-12-03T01:21:28Z) - Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - SnapCap: Efficient Snapshot Compressive Video Captioning [18.016261978231835]
Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos.
In this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera.
To better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP.
arXiv Detail & Related papers (2024-01-10T03:11:21Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.