Related papers: Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision

Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision

URL: http://arxiv.org/abs/2506.20070v1
Date: Wed, 25 Jun 2025 00:25:08 GMT
Title: Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision
Authors: KMA Solaiman, Bharat Bhargava,
Abstract summary: "FemmIR" is a framework to retrieve results relevant to information needs expressed with multimodal queries by example without any similarity label.<n>We empirically evaluate FemmIR on a missing person use case with MuQNOL.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing multi-media retrieval models either rely on creating a common subspace with modality-specific representation models or require schema mapping among modalities to measure similarities among multi-media data. Our goal is to avoid the annotation overhead incurred from considering retrieval as a supervised classification task and re-use the pretrained encoders in large language models and vision tasks. We propose "FemmIR", a framework to retrieve multimodal results relevant to information needs expressed with multimodal queries by example without any similarity label. Such identification is necessary for real-world applications where data annotations are scarce and satisfactory performance is required without fine-tuning with a common framework across applications. We curate a new dataset called MuQNOL for benchmarking progress on this task. Our technique is based on weak supervision introduced through edit distance between samples: graph edit distance can be modified to consider the cost of replacing a data sample in terms of its properties, and relevance can be measured through the implicit signal from the amount of edit cost among the objects. Unlike metric learning or encoding networks, FemmIR re-uses the high-level properties and maintains the property value and relationship constraints with a multi-level interaction score between data samples and the query example provided by the user. We empirically evaluate FemmIR on a missing person use case with MuQNOL. FemmIR performs comparably to similar retrieval systems in delivering on-demand retrieval results with exact and approximate similarities while using the existing property identifiers in the system.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval [29.05476868272228]
Instance-Driven Multimodal Image Retrieval (IDMR) is a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario.<n>To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data.<n>Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench.
arXiv Detail & Related papers (2025-04-01T16:47:20Z)
MultiConIR: Towards multi-condition Information Retrieval [57.6405602406446]
We introduce MultiConIR, the first benchmark designed to evaluate retrieval models in multi-condition scenarios.<n>We propose three tasks to assess retrieval and reranking models on multi-condition robustness, monotonic relevance ranking, and query format sensitivity.
arXiv Detail & Related papers (2025-03-11T05:02:03Z)
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.<n>Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.<n>A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z)
Do Retrieval-Augmented Language Models Adapt to Varying User Needs? [28.729041459278587]
This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases.<n>By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications.<n>Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems.
arXiv Detail & Related papers (2025-02-27T05:39:38Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition [9.506482334842293]
Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task.<n>Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task.<n>We propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels.
arXiv Detail & Related papers (2024-07-17T05:42:43Z)
CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
MMGRec: Multimodal Generative Recommendation with Transformer Model [81.61896141495144]
MMGRec aims to introduce a generative paradigm into multimodal recommendation. We first devise a hierarchical quantization method Graph CF-RQVAE to assign Rec-ID for each item from its multimodal information. We then train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences.
arXiv Detail & Related papers (2024-04-25T12:11:27Z)
How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation [1.7812428873698403]
We propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics. These benchmark data sets can then be used for model training and a variety of evaluation tasks.
arXiv Detail & Related papers (2024-04-08T15:53:29Z)
SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting [71.38754976584009]
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image. We propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (Net) It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size.
arXiv Detail & Related papers (2023-11-16T16:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.