Related papers: ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

URL: http://arxiv.org/abs/2204.02874v1
Date: Wed, 6 Apr 2022 14:43:42 GMT
Title: ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Authors: Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
Abstract summary: We introduce an audiovisual method for long-range text-to-video retrieval. Our approach aims to retrieve minute-long videos that capture complex human actions. Our method is 2.92x faster and 2.34x memory-efficient than long-range video-only approaches.
Score: 103.28102473127748
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5-15 seconds in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named ECLIPSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being 2.92x faster and 2.34x memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.

Related papers

Unleashing Hour-Scale Video Training for Long Video-Language Understanding [61.717205915329664]
We present VideoMarathon, a large-scale hour-long video instruction-following dataset.<n>This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video.<n>We propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling.
arXiv Detail & Related papers (2025-06-05T17:59:04Z)
Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric [1.9774761182870912]
We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. We conduct experiments on the YouCook2 benchmark, showing promising retrieval performance.
arXiv Detail & Related papers (2025-04-06T18:18:09Z)
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup [2.80888070977859]
We propose audio-visual SSL for video action recognition, which uses both visual and audio together. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
arXiv Detail & Related papers (2025-03-04T05:13:56Z)
Parameter-free Video Segmentation for Vision and Language Understanding [55.20132267309382]
We propose an algorithm for segmenting videos into contiguous chunks, based on the minimum description length principle. The algorithm is entirely parameter-free, given feature vectors, not requiring a set threshold or the number or size of chunks to be specified.
arXiv Detail & Related papers (2025-03-03T05:54:37Z)
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs) This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z)
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context [19.224601064352846]
We introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. We present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video. Experiments demonstrate that SAVEnVideo outperforms the best Video-LLM by 3.61% on the zero-shot long video task (Video-MME) and surpasses the leading audio-visual LLM by 1.29% on the zero-shot audio-visual task (Music-AVQA)
arXiv Detail & Related papers (2024-11-25T09:22:13Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z)
LVCHAT: Facilitating Long Video Comprehension [25.395689904747965]
We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos. LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
arXiv Detail & Related papers (2024-02-19T11:59:14Z)
Beyond the Frame: Single and mutilple video summarization method with user-defined length [4.424739166856966]
Video summarizing is a difficult but significant work, with substantial potential for further research and development. In this paper, we combine a variety of NLP techniques (extractive and contect-based summarizers) with video processing techniques to convert a long video into a single relatively short video.
arXiv Detail & Related papers (2023-12-23T04:32:07Z)
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime. This neglects the richness and variety of possible valid descriptions of a video. We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.