Related papers: Unsupervised Collaborative Metric Learning with Mixed-Scale Groups for General Object Retrieval

Related papers

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z)
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering [55.49652734090316]
A knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval.<n>We propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages.<n> Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements in answer quality, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T12:10:00Z)
Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook [6.356031718676495]
Large language models (LLMs) have emerged as novel aids for information retrieval during learning.<n>We investigate Retrieval-Augmented Generation (RAG) and GraphRAG, a knowledge graph-enhanced RAG approach, for page-level question answering.
arXiv Detail & Related papers (2025-09-20T19:06:49Z)
Composed Object Retrieval: Object-level Retrieval via Composed Expressions [71.47650333199628]
Composed Object Retrieval (COR) is a brand-new task that goes beyond image-level retrieval to achieve object-level precision.<n>We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories.<n>We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning.
arXiv Detail & Related papers (2025-08-06T13:11:40Z)
Are We Done with Object-Centric Learning? [65.67948794110212]
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. With recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. We address the OOD generalization challenge caused by spurious background cues through the lens of OCL.
arXiv Detail & Related papers (2025-04-09T17:59:05Z)
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines [17.803396998387665]
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task.<n>We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task.<n>Our model functions both as a generative retriever and an accurate answer generator.
arXiv Detail & Related papers (2025-02-23T16:39:39Z)
Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering [20.16172308719101]
Zero-shot visual question answering (ZS-VQA) intends to answer visual questions without providing training samples.<n>Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs) as external information sources.<n>We propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer.
arXiv Detail & Related papers (2025-01-22T08:14:11Z)
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts [14.631774737903015]
Existing perception models achieve great success by learning from large amounts of labeled data, but they still struggle with open-world scenarios. We present textiti.e., open-ended object detection, which discovers unseen objects without any object categories as inputs. We show that our method surpasses the previous open-ended method on the object detection task and can provide additional instance segmentation masks.
arXiv Detail & Related papers (2024-10-08T12:15:08Z)
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. We have established a new REC dataset characterized by two key features. It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z)
EchoSight: Advancing Visual-Language Models with Wiki Knowledge [39.02148880719576]
We introduce EchoSight, a novel framework for knowledge-based Visual Question Answering.<n>To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information.<n>Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA.
arXiv Detail & Related papers (2024-07-17T16:55:42Z)
SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM [48.15067480282839]
This work introduces a novel evaluative benchmark named textbfSnapNTell, specifically tailored for entity-centric VQA. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5% improvement in the BELURT score.
arXiv Detail & Related papers (2024-03-07T18:38:17Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools. AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question Answering Over Knowledge Graph [89.98762327725112]
Multi-hop Question Answering over Knowledge Graph(KGQA) aims to find the answer entities that are multiple hops away from the topic entities mentioned in a natural language question. We propose UniKGQA, a novel approach for multi-hop KGQA task, by unifying retrieval and reasoning in both model architecture and parameter learning.
arXiv Detail & Related papers (2022-12-02T04:08:09Z)
Fusing Local Similarities for Retrieval-based 3D Orientation Estimation of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images. We follow a retrieval-based strategy and prevent the network from learning object-specific features. Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z)
Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object. This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z)
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently. We propose a visual capsule module with a query-based selection mechanism of capsule features. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Unsupervised Discovery of the Long-Tail in Instance Segmentation Using Hierarchical Self-Supervision [3.841232411073827]
We propose a method that can perform unsupervised discovery of long-tail categories in instance segmentation. Our model is able to discover novel and more fine-grained objects than the common categories. We show that the model achieves competitive quantitative results on LVIS as compared to the supervised and partially supervised methods.
arXiv Detail & Related papers (2021-04-02T22:05:03Z)
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
Novel Human-Object Interaction Detection via Adversarial Domain Generalization [103.55143362926388]
We study the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios. The challenge mainly stems from the large compositional space of objects and predicates, which leads to the lack of sufficient training data for all the object-predicate combinations. We propose a unified framework of adversarial domain generalization to learn object-invariant features for predicate prediction.
arXiv Detail & Related papers (2020-05-22T22:02:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.