Related papers: Image-text matching for large-scale book collections

Image-text matching for large-scale book collections

URL: http://arxiv.org/abs/2407.19812v1
Date: Mon, 29 Jul 2024 09:05:04 GMT
Title: Image-text matching for large-scale book collections
Authors: Artemis Llabrés, Arka Ujjal Dey, Dimosthenis Karatzas, Ernest Valveny,
Abstract summary: We address the problem of detecting and mapping all books in a collection of images to entries in a given book catalogue. We combine a state-of-the-art segmentation method (SAM) to detect book spines and extract book information using a commercial OCR. To evaluate our approach, we publish a new dataset of annotated bookshelf images that covers the whole book collection of a public library in Spain.
Score: 10.444851303425589
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the problem of detecting and mapping all books in a collection of images to entries in a given book catalogue. Instead of performing independent retrieval for each book detected, we treat the image-text mapping problem as a many-to-many matching process, looking for the best overall match between the two sets. We combine a state-of-the-art segmentation method (SAM) to detect book spines and extract book information using a commercial OCR. We then propose a two-stage approach for text-image matching, where CLIP embeddings are used first for fast matching, followed by a second slower stage to refine the matching, employing either the Hungarian Algorithm or a BERT-based model trained to cope with noisy OCR input and partial text matches. To evaluate our approach, we publish a new dataset of annotated bookshelf images that covers the whole book collection of a public library in Spain. In addition, we provide two target lists of book metadata, a closed-set of 15k book titles that corresponds to the known library inventory, and an open-set of 2.3M book titles to simulate an open-world scenario. We report results on two settings, on one hand on a matching-only task, where the book segments and OCR is given and the objective is to perform many-to-many matching against the target lists, and a combined detection and matching task, where books must be first detected and recognised before they are matched to the target list entries. We show that both the Hungarian Matching and the proposed BERT-based model outperform a fuzzy string matching baseline, and we highlight inherent limitations of the matching algorithms as the target increases in size, and when either of the two sets (detected books or target book list) is incomplete. The dataset and code are available at https://github.com/llabres/library-dataset

Related papers

Generative Retrieval for Book search [106.67655212825025]
We propose an effective Generative retrieval framework for Book Search. It features two main components: data augmentation and outline-oriented book encoding. Experiments on a proprietary Baidu dataset demonstrate that GBS outperforms strong baselines.
arXiv Detail & Related papers (2025-01-19T12:57:13Z)
Masked Image Modeling: A Survey [73.21154550957898]
Masked image modeling emerged as a powerful self-supervised learning technique in computer vision. We construct a taxonomy and review the most prominent papers in recent years. We aggregate the performance results of various masked image modeling methods on the most popular datasets.
arXiv Detail & Related papers (2024-08-13T07:27:02Z)
Keyword Spotting Simplified: A Segmentation-Free Approach using Character Counting and CTC re-scoring [8.6134769826665]
Recent advances in segmentation-free keyword spotting treat this problem w.r.t. as an object detection paradigm. We propose a novel segmentation-free system that efficiently scans a document image to find rectangular areas that include the query information.
arXiv Detail & Related papers (2023-08-07T12:11:04Z)
CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data. Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods. We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z)
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation [42.35572014527354]
The AToMiC dataset is designed to advance research in image/text cross-modal retrieval. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
arXiv Detail & Related papers (2023-04-04T17:11:34Z)
ASIC: Aligning Sparse in-the-wild Image Collections [86.66498558225625]
We present a method for joint alignment of sparse in-the-wild image collections of an object category. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences.
arXiv Detail & Related papers (2023-03-28T17:59:28Z)
Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps. We first scrape images for the objects of interest from popular image search engines. We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z)
Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework. It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries. Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z)
It's AI Match: A Two-Step Approach for Schema Matching Using Embeddings [10.732163031244646]
We propose a novel end-to-end approach for schema matching based on neural embeddings. Our results show that our approach is able to determine correspondences in a robust and reliable way.
arXiv Detail & Related papers (2022-03-08T19:42:28Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
Compact Deep Aggregation for Set Retrieval [87.52470995031997]
We focus on retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities. We show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that.
arXiv Detail & Related papers (2020-03-26T08:43:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.