VADER: Video Alignment Differencing and Retrieval
- URL: http://arxiv.org/abs/2303.13193v2
- Date: Sat, 25 Mar 2023 10:00:51 GMT
- Title: VADER: Video Alignment Differencing and Retrieval
- Authors: Alexander Black, Simon Jenni, Tu Bui, Md. Mehrab Tanjim, Stefano
Petrangeli, Ritwik Sinha, Viswanathan Swaminathan, John Collomosse
- Abstract summary: VADER matches and aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over chunked video content.
A space-time comparator module identifies regions of manipulation between content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content.
- Score: 70.88247176534426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose VADER, a spatio-temporal matching, alignment, and change
summarization method to help fight misinformation spread via manipulated
videos. VADER matches and coarsely aligns partial video fragments to candidate
videos using a robust visual descriptor and scalable search over adaptively
chunked video content. A transformer-based alignment module then refines the
temporal localization of the query fragment within the matched video. A
space-time comparator module identifies regions of manipulation between aligned
content, invariant to any changes due to any residual temporal misalignments or
artifacts arising from non-editorial changes of the content. Robustly matching
video to a trusted source enables conclusions to be drawn on video provenance,
enabling informed trust decisions on content encountered.
Related papers
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework.
It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback.
VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - CoDeF: Content Deformation Fields for Temporally Consistent Video
Processing [89.49585127724941]
CoDeF is a new type of video representation, which consists of a canonical content field and a temporal deformation field.
We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.
arXiv Detail & Related papers (2023-08-15T17:59:56Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval [32.11951065619957]
We re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video.
When the clip is short or visually ambiguous, knowledge of its local temporal context can be used to improve the retrieval performance.
We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations.
arXiv Detail & Related papers (2022-10-09T20:11:38Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - VPN: Video Provenance Network for Robust Content Attribution [72.12494245048504]
We present VPN - a content attribution method for recovering provenance information from videos shared online.
We learn a robust search embedding for matching such video, using full-length or truncated video queries.
Once matched against a trusted database of video clips, associated information on the provenance of the clip is presented to the user.
arXiv Detail & Related papers (2021-09-21T09:07:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.