ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
- URL: http://arxiv.org/abs/2204.02874v1
- Date: Wed, 6 Apr 2022 14:43:42 GMT
- Title: ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
- Authors: Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
- Abstract summary: We introduce an audiovisual method for long-range text-to-video retrieval.
Our approach aims to retrieve minute-long videos that capture complex human actions.
Our method is 2.92x faster and 2.34x memory-efficient than long-range video-only approaches.
- Score: 103.28102473127748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce an audiovisual method for long-range text-to-video retrieval.
Unlike previous approaches designed for short video retrieval (e.g., 5-15
seconds in duration), our approach aims to retrieve minute-long videos that
capture complex human actions. One challenge of standard video-only approaches
is the large computational cost associated with processing hundreds of densely
extracted frames from such long videos. To address this issue, we propose to
replace parts of the video with compact audio cues that succinctly summarize
dynamic audio events and are cheap to process. Our method, named ECLIPSE
(Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an
audiovisual video setting, by adding a unified audiovisual transformer block
that captures complementary cues from the video and audio streams. In addition
to being 2.92x faster and 2.34x memory-efficient than long-range video-only
approaches, our method also achieves better text-to-video retrieval accuracy on
several diverse long-range video datasets such as ActivityNet, QVHighlights,
YouCook2, DiDeMo and Charades.
Related papers
- SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context [19.224601064352846]
We introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions.
We present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video.
Experiments demonstrate that SAVEnVideo outperforms the best Video-LLM by 3.61% on the zero-shot long video task (Video-MME) and surpasses the leading audio-visual LLM by 1.29% on the zero-shot audio-visual task (Music-AVQA)
arXiv Detail & Related papers (2024-11-25T09:22:13Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - LVCHAT: Facilitating Long Video Comprehension [25.395689904747965]
We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos.
LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
arXiv Detail & Related papers (2024-02-19T11:59:14Z) - Beyond the Frame: Single and mutilple video summarization method with
user-defined length [4.424739166856966]
Video summarizing is a difficult but significant work, with substantial potential for further research and development.
In this paper, we combine a variety of NLP techniques (extractive and contect-based summarizers) with video processing techniques to convert a long video into a single relatively short video.
arXiv Detail & Related papers (2023-12-23T04:32:07Z) - A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production.
In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.