Related papers: Needle: A Generative AI-Powered Multi-modal Database for Answering Complex Natural Language Queries

Needle: A Generative AI-Powered Multi-modal Database for Answering Complex Natural Language Queries

URL: http://arxiv.org/abs/2412.00639v2
Date: Mon, 02 Jun 2025 15:22:19 GMT
Title: Needle: A Generative AI-Powered Multi-modal Database for Answering Complex Natural Language Queries
Authors: Mahdi Erfanian, Mohsen Dehghankar, Abolfazl Asudeh,
Abstract summary: Multi-modal datasets often miss the detailed descriptions that properly capture the rich information encoded in each item.<n>This makes answering complex natural language queries a major challenge in this domain.<n>We introduce a Generative-based Monte Carlo method that utilizes foundation models to generate synthetic samples.<n>Our system is open-source and ready for deployment, designed to be easily adopted by researchers and developers.
Score: 8.779871128906787
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multi-modal datasets, like those involving images, often miss the detailed descriptions that properly capture the rich information encoded in each item. This makes answering complex natural language queries a major challenge in this domain. In particular, unlike the traditional nearest neighbor search, where the tuples and the query are represented as points in a single metric space, these settings involve queries and tuples embedded in fundamentally different spaces, making the traditional query answering methods inapplicable. Existing literature addresses this challenge for image datasets through vector representations jointly trained on natural language and images. This technique, however, underperforms for complex queries due to various reasons. This paper takes a step towards addressing this challenge by introducing a Generative-based Monte Carlo method that utilizes foundation models to generate synthetic samples that capture the complexity of the natural language query and represent it in the same metric space as the multi-modal data. Following this method, we propose Needle, a database for image data retrieval. Instead of relying on contrastive learning or metadata-searching approaches, our system is based on synthetic data generation to capture the complexities of natural language queries. Our system is open-source and ready for deployment, designed to be easily adopted by researchers and developers. The comprehensive experiments on various benchmark datasets verify that this system significantly outperforms state-of-the-art text-to-image retrieval methods in the literature. Any foundation model and embedder can be easily integrated into Needle to improve the performance, piggybacking on the advancements in these technologies.

Related papers

Multi-turn Natural Language to Graph Query Language Translation [15.249580032219336]
In practical applications, user interactions with graph databases are typically multi-turn, dynamic, and context-dependent.<n>Research focused on single-turn conversion fails to effectively address multi-turn dialogues and complex context dependencies.<n>We propose an automated method for constructing multi-turn NL2GQL datasets based on Large Language Models (LLMs)
arXiv Detail & Related papers (2025-08-03T17:56:52Z)
Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
GridMind: A Multi-Agent NLP Framework for Unified, Cross-Modal NFL Data Insights [0.0]
This paper introduces GridMind, a framework that unifies structured, semi-structured, and unstructured data through Retrieval-Augmented Generation (RAG) and large language models (LLMs) This approach aligns with the evolving field of multimodal representation learning, where unified models are increasingly essential for real-time, cross-modal interactions.
arXiv Detail & Related papers (2025-03-24T18:33:36Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent [6.147666891384964]
XMODE is a system that enables explainable, multi-modal data exploration in natural language.<n>XMODE is inspired by a real-world use case that enables users to explore multi-modal information systems.
arXiv Detail & Related papers (2024-12-24T13:42:44Z)
Leveraging LLMs to Enable Natural Language Search on Go-to-market Platforms [0.23301643766310368]
We implement and evaluate a solution for the Zoominfo product for sellers, which prompts the Large Language Models with natural language. The intermediary search fields offer numerous advantages for each query, including the elimination of syntax errors. Comprehensive experiments with closed, open source, and fine-tuned LLM models were conducted to demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2024-11-07T03:58:38Z)
Data Fusion of Synthetic Query Variants With Generative Large Language Models [1.864807003137943]
This work explores the feasibility of using synthetic query variants generated by instruction-tuned Large Language Models in data fusion experiments. We introduce a lightweight, unsupervised, and cost-efficient approach that exploits principled prompting and data fusion techniques. Our analysis shows that data fusion based on synthetic query variants is significantly better than baselines with single queries and also outperforms pseudo-relevance feedback methods.
arXiv Detail & Related papers (2024-11-06T12:54:27Z)
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z)
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs [12.846097618151951]
We develop a dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage. We synthesize realistic user queries that integrate diverse topological structures, annotated information, and complex textual descriptions. We introduce an enhanced Monte Carlo Tree Search (CTS) method, which automatically extracts relational path information from textual graphs for specific queries.
arXiv Detail & Related papers (2024-10-17T19:33:37Z)
A Survey of Multimodal Composite Editing and Retrieval [7.966265020507201]
This survey is the first comprehensive review of the literature on multimodal composite retrieval. It covers image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. We systematically organize the application scenarios, methods, benchmarks, experiments, and future directions.
arXiv Detail & Related papers (2024-09-09T08:06:50Z)
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs [6.879945062426145]
We generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs. We demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.
arXiv Detail & Related papers (2024-06-28T01:14:43Z)
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval. ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
Unity by Diversity: Improved Representation Learning in Multimodal VAEs [24.691068754720106]
We show that a better latent representation can be obtained by replacing hard constraints with a soft constraint. We show improved learned latent representations and imputation of missing data modalities compared to existing methods.
arXiv Detail & Related papers (2024-03-08T13:29:46Z)
ReFusion: Improving Natural Language Understanding with Computation-Efficient Retrieval Representation Fusion [22.164620956284466]
Retrieval-based augmentations (RA) incorporating knowledge from an external database into language models have greatly succeeded in various knowledge-intensive (KI) tasks. Existing works focus on concatenating retrievals with inputs to improve model performance. This paper proposes a new paradigm of RA named textbfReFusion, a computation-efficient Retrieval representation Fusion with bi-level optimization.
arXiv Detail & Related papers (2024-01-04T07:39:26Z)
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps. A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data. We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z)
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG) MuRAG accesses an external non-parametric multimodal memory to augment language generation. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching. Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.