Related papers: Reframe Anything: LLM Agent for Open World Video Reframing

Reframe Anything: LLM Agent for Open World Video Reframing

URL: http://arxiv.org/abs/2403.06070v1
Date: Sun, 10 Mar 2024 03:29:56 GMT
Title: Reframe Anything: LLM Agent for Open World Video Reframing
Authors: Jiawang Cao, Yongliang Wu, Weiheng Chi, Wenbo Zhu, Ziyue Su, Jay Wu
Abstract summary: We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
Score: 0.8424099022563256
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of mobile devices and social media has revolutionized content dissemination, with short-form video becoming increasingly prevalent. This shift has introduced the challenge of video reframing to fit various screen aspect ratios, a process that highlights the most compelling parts of a video. Traditionally, video reframing is a manual, time-consuming task requiring professional expertise, which incurs high production costs. A potential solution is to adopt some machine learning models, such as video salient object detection, to automate the process. However, these methods often lack generalizability due to their reliance on specific training data. The advent of powerful large language models (LLMs) open new avenues for AI capabilities. Building on this, we introduce Reframe Any Video Agent (RAVA), a LLM-based agent that leverages visual foundation models and human instructions to restructure visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.

Related papers

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z)
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [63.82450803014141]
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity.<n>We propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset.
arXiv Detail & Related papers (2025-05-23T16:37:36Z)
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z)
LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection [14.687867348598035]
Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection. We propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting.
arXiv Detail & Related papers (2025-02-20T19:34:58Z)
VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries. VideoRAG is powered by recent Large Video Language Models (LVLMs) We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands. We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries. It features an Divide-and-Conquer Loop capable of autonomous reasoning. We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs) We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks. Our model can edit and translate the desired results within seconds based on user instructions. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z)
Retargeting video with an end-to-end framework [14.270721529264929]
We present an end-to-end RETVI method to retarget videos to arbitrary ratios. Our system outperforms previous work in quality and running time.
arXiv Detail & Related papers (2023-11-08T04:56:41Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks. InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives. InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z)
The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing [90.59584961661345]
This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing. Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling. To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes.
arXiv Detail & Related papers (2022-07-20T10:53:48Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.