Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor
- URL: http://arxiv.org/abs/2307.02291v3
- Date: Sun, 22 Dec 2024 06:14:28 GMT
- Title: Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor
- Authors: Junwen Chen, Yingcheng Wang, Keiji Yanai,
- Abstract summary: We introduce SOV-STG-VLA with three key components: Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA)
Our SOV decoders disentangle object detection and verb recognition with a novel interaction region representation.
Our VLA notably improves our SOV-STG and achieves SOTA performance with one-sixth of training epochs compared to recent SOTA.
- Score: 15.364628173661778
- License:
- Abstract: Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of Vision-Language Model (VLM). However, these methods suffer from extended training times and complex optimization due to the entanglement of object detection and HOI recognition during the decoding process. Especially, the query embeddings used to predict both labels and boxes suffer from ambiguous representations, and the gap between the prediction of HOI labels and verb labels is not considered. To address these challenges, we introduce SOV-STG-VLA with three key components: Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA). Our SOV decoders disentangle object detection and verb recognition with a novel interaction region representation. The STG denoising strategy learns label embeddings with ground-truth information to guide the training and inference. Our SOV-STG achieves a fast convergence speed and high accuracy and builds a foundation for the VLA to incorporate the prior knowledge of the VLM. We introduce a vision advisor decoder to fuse both the interaction region information and the VLM's vision knowledge and a Verb-HOI prediction bridge to promote interaction representation learning. Our VLA notably improves our SOV-STG and achieves SOTA performance with one-sixth of training epochs compared to recent SOTA. Code and models are available at https://github.com/cjw2021/SOV-STG-VLA
Related papers
- VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis [8.676256316312877]
We introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (bfVLM-HOI)
We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature.
Experiments demonstrate the effectiveness of our method, achieving state-of-the-art HOI detection accuracy on benchmarks.
arXiv Detail & Related papers (2024-11-27T04:13:23Z) - CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models [10.62320998365966]
Vision Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection.
We propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations.
Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations.
arXiv Detail & Related papers (2024-10-21T05:51:51Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for
Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach.
We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - Vision-Language Adaptive Mutual Decoder for OOV-STR [39.35424739459689]
We design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle out-of-vocabulary (OOV) problems partly.
Our approach achieved 70.31% and 59.61% word accuracy on IV+OOV and OOV settings respectively on Cropped Word Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop.
arXiv Detail & Related papers (2022-09-02T07:32:22Z) - GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.