An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2503.17285v1
- Date: Fri, 21 Mar 2025 16:34:04 GMT
- Title: An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection
- Authors: Louis Y. Kim, Michelle Karker, Victoria Valledor, Seiyoung C. Lee, Karl F. Brzoska, Margaret Duff, Anthony Palladino,
- Abstract summary: We present an approach for improving non-technical users' natural language text descriptions of their desired targets of interest.<n>We quantify the improvement that our feedback mechanism provides by demonstrating performance with multiple publicly-available open-vocabulary object detection models.
- Score: 0.08974531206817744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in open-vocabulary object detection models will enable Automatic Target Recognition systems to be sustainable and repurposed by non-technical end-users for a variety of applications or missions. New, and potentially nuanced, classes can be defined with natural language text descriptions in the field, immediately before runtime, without needing to retrain the model. We present an approach for improving non-technical users' natural language text descriptions of their desired targets of interest, using a combination of analysis techniques on the text embeddings, and proper combinations of embeddings for contrastive examples. We quantify the improvement that our feedback mechanism provides by demonstrating performance with multiple publicly-available open-vocabulary object detection models.
Related papers
- An Application-Agnostic Automatic Target Recognition System Using Vision Language Models [32.858386851006316]
We present a novel Automatic Target Recognition (ATR) system using open-vocabulary object detection and classification models.
A primary advantage of this approach is that target classes can be defined just before runtime by a non-technical end user.
Nuances in the desired targets can be expressed in natural language, which is useful for unique targets with little or no training data.
arXiv Detail & Related papers (2024-11-05T20:16:15Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability.
It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs.
The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Improving Adversarial Text Generation by Modeling the Distant Future [155.83051741029732]
We consider a text planning scheme and present a model-based imitation-learning approach to alleviate the aforementioned issues.
We propose a novel guider network to focus on the generative process over a longer horizon, which can assist next-word prediction and provide intermediate rewards for generator optimization.
arXiv Detail & Related papers (2020-05-04T05:45:13Z) - Inferring Temporal Compositions of Actions Using Probabilistic Automata [61.09176771931052]
We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata.
Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
arXiv Detail & Related papers (2020-04-28T00:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.