PICS: Pipeline for Image Captioning and Search
- URL: http://arxiv.org/abs/2402.10090v1
- Date: Thu, 1 Feb 2024 03:08:21 GMT
- Title: PICS: Pipeline for Image Captioning and Search
- Authors: Grant Rosario, David Noever
- Abstract summary: This paper introduces PICS (Pipeline for Image Captioning and Search), a novel approach designed to address the complexities inherent in organizing large-scale image repositories.
The approach is rooted in the understanding that meaningful, AI-generated captions can significantly enhance the searchability and accessibility of images in large databases.
The significance of PICS lies in its potential to transform image database systems, harnessing the power of machine learning and natural language processing to meet the demands of modern digital asset management.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing volume of digital images necessitates advanced systems for
efficient categorization and retrieval, presenting a significant challenge in
database management and information retrieval. This paper introduces PICS
(Pipeline for Image Captioning and Search), a novel approach designed to
address the complexities inherent in organizing large-scale image repositories.
PICS leverages the advancements in Large Language Models (LLMs) to automate the
process of image captioning, offering a solution that transcends traditional
manual annotation methods. The approach is rooted in the understanding that
meaningful, AI-generated captions can significantly enhance the searchability
and accessibility of images in large databases. By integrating sentiment
analysis into the pipeline, PICS further enriches the metadata, enabling
nuanced searches that extend beyond basic descriptors. This methodology not
only simplifies the task of managing vast image collections but also sets a new
precedent for accuracy and efficiency in image retrieval. The significance of
PICS lies in its potential to transform image database systems, harnessing the
power of machine learning and natural language processing to meet the demands
of modern digital asset management.
Related papers
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models [2.3301643766310374]
By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data.
We show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods.
We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
arXiv Detail & Related papers (2024-08-29T06:54:03Z) - Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression [0.6345523830122168]
Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data.
We proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression.
Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
arXiv Detail & Related papers (2024-04-16T02:29:00Z) - Enhancing Image Retrieval : A Comprehensive Study on Photo Search using
the CLIP Mode [0.27195102129095]
Photo search has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model.
This abstract summarizes the foundational principles of CLIP and highlights its potential impact on advancing the field of photo search.
arXiv Detail & Related papers (2024-01-24T17:35:38Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features [12.14013374452918]
We present a simple yet effective approach to object-centric open-vocabulary image retrieval.
Our approach aggregates dense embeddings extracted from CLIP into a compact representation.
We show the effectiveness of our scheme to the task by achieving significantly better results than global feature approaches on three datasets.
arXiv Detail & Related papers (2023-09-26T15:13:09Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.