VTC: Improving Video-Text Retrieval with User Comments
- URL: http://arxiv.org/abs/2210.10820v1
- Date: Wed, 19 Oct 2022 18:11:39 GMT
- Title: VTC: Improving Video-Text Retrieval with User Comments
- Authors: Laura Hanu, James Thewlis, Yuki M. Asano, Christian Rupprecht
- Abstract summary: This paper introduces a new dataset of videos, titles and comments.
By using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.
- Score: 22.193221760244707
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal retrieval is an important problem for many applications, such as
recommendation and search. Current benchmarks and even datasets are often
manually constructed and consist of mostly clean samples where all modalities
are well-correlated with the content. Thus, current video-text retrieval
literature largely focuses on video titles or audio transcripts, while ignoring
user comments, since users often tend to discuss topics only vaguely related to
the video. Despite the ubiquity of user comments online, there is currently no
multi-modal representation learning datasets that includes comments. In this
paper, we a) introduce a new dataset of videos, titles and comments; b) present
an attention-based mechanism that allows the model to learn from sometimes
irrelevant data such as comments; c) show that by using comments, our method is
able to learn better, more contextualised, representations for image, video and
audio representations. Project page: https://unitaryai.github.io/vtc-paper.
Related papers
- Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Multi-event Video-Text Retrieval [33.470499262092105]
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet.
We introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events.
We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task.
arXiv Detail & Related papers (2023-08-22T16:32:46Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - Modality-Balanced Embedding for Video Retrieval [21.81705847039759]
We identify a modality bias phenomenon that the video encoder almost entirely relies on text matching.
We propose MBVR (short for Modality Balanced Video Retrieval) with two key components.
We show empirically that our method is both effective and efficient in solving modality bias problem.
arXiv Detail & Related papers (2022-04-18T06:29:46Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z) - A Straightforward Framework For Video Retrieval Using CLIP [0.0]
Video Retrieval is a challenging task where a text query is matched to a video or vice versa.
Most of the existing approaches for addressing such a problem rely on annotations made by the users.
In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations.
arXiv Detail & Related papers (2021-02-24T18:15:12Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.