An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set
- URL: http://arxiv.org/abs/2408.05772v1
- Date: Sun, 11 Aug 2024 13:40:02 GMT
- Title: An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set
- Authors: Chaoyi Ai,
- Abstract summary: Under default settings, Human-Object Interaction (HOI) performance is nearly saturated.
This study uses two experimental settings: grounding truth and random arbitrary combinations.
We find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming $\langle human, object, verb \rangle$ triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.
Related papers
- Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training [4.062463195973711]
We investigate the role of 3 primary variables in a limited data regime as part of the BabyLM challenge.
We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models.
arXiv Detail & Related papers (2024-10-20T21:03:51Z) - Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.
We propose a novel approach to address this issue at test time without requiring retraining.
MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - RelVAE: Generative Pretraining for few-shot Visual Relationship
Detection [2.2230760534775915]
We present the first pretraining method for few-shot predicate classification that does not require any annotated relations.
We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets.
arXiv Detail & Related papers (2023-11-27T19:08:08Z) - Learn What Is Possible, Then Choose What Is Best: Disentangling
One-To-Many Relations in Language Through Text-based Games [3.615981646205045]
We present an approach to train language models that can emulate the desirable behaviours, but not the undesirable ones.
Using text-based games as a testbed, our approach, PASA, uses discrete latent variables to capture the range of different behaviours.
Results show up to 49% empirical improvement over the previous state-of-the-art model.
arXiv Detail & Related papers (2023-04-14T17:11:26Z) - Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models [69.31424345583537]
Humans use cross-modal information to learn new concepts efficiently.
We show that one can indeed build a better $bf visual$ dog classifier by reading about dogs and listening to them bark.
We construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
arXiv Detail & Related papers (2023-01-16T05:40:42Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - Assisting Scene Graph Generation with Self-Supervision [21.89909688056478]
We propose a set of three novel yet simple self-supervision tasks and train them as auxiliary multi-tasks to the main model.
While comparing, we train the base-model from scratch with these self-supervision tasks, we achieve state-of-the-art results in all the metrics and recall settings.
arXiv Detail & Related papers (2020-08-08T16:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.