The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search
Engines for Large-Scale Video Retrieval
- URL: http://arxiv.org/abs/2008.02749v2
- Date: Thu, 18 Mar 2021 14:37:27 GMT
- Title: The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search
Engines for Large-Scale Video Retrieval
- Authors: Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole,
Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, Claudio Vairo
- Abstract summary: VISIONE allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial, relationships and image similarity.
The peculiarity of our approach is that we encode all the information extracted from the videos using a convenient textual encoding in a single text retrieval engine.
This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) have to be merged.
- Score: 11.217452391653762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we describe in details VISIONE, a video search system that
allows users to search for videos using textual keywords, occurrence of objects
and their spatial relationships, occurrence of colors and their spatial
relationships, and image similarity. These modalities can be combined together
to express complex queries and satisfy user needs. The peculiarity of our
approach is that we encode all the information extracted from the keyframes,
such as visual deep features, tags, color and object locations, using a
convenient textual encoding indexed in a single text retrieval engine. This
offers great flexibility when results corresponding to various parts of the
query (visual, text and locations) have to be merged. In addition, we report an
extensive analysis of the system retrieval performance, using the query logs
generated during the Video Browser Showdown (VBS) 2019 competition. This
allowed us to fine-tune the system by choosing the optimal parameters and
strategies among the ones that we tested.
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - A Feature Analysis for Multimodal News Retrieval [9.269820020286382]
We consider five feature types for image and text and compare the performance of the retrieval system using different combinations.
Experimental results show that retrieval results can be improved when considering both visual and textual information.
arXiv Detail & Related papers (2020-07-13T14:09:29Z) - Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval [98.62404433761432]
The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems.
Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries.
We propose a Tree-augmented Cross-modal.
method by jointly learning the linguistic structure of queries and the temporal representation of videos.
arXiv Detail & Related papers (2020-07-06T02:50:27Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.