Related papers: ART: Adaptive Relation Tuning for Generalized Relation Prediction

ART: Adaptive Relation Tuning for Generalized Relation Prediction

URL: http://arxiv.org/abs/2507.23543v1
Date: Thu, 31 Jul 2025 13:34:06 GMT
Title: ART: Adaptive Relation Tuning for Generalized Relation Prediction
Authors: Gopika Sudhakaran, Hikaru Shindo, Patrick Schramowski, Simone Schaub-Meyer, Kristian Kersting, Stefan Roth,
Abstract summary: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene.<n>While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations.<n>We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data.
Score: 33.15138052099355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.

Related papers

Task-Agnostic Contrastive Pretraining for Relational Deep Learning [0.0]
We propose a novel task-agnostic contrastive pretraining approach for RDL that enables database-wide representation learning.<n>We implement the respective pretraining approach through a modular RDL architecture.<n>Our preliminary results demonstrate that finetuning the pretrained models measurably outperforms training from scratch.
arXiv Detail & Related papers (2025-06-27T13:18:13Z)
Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image.<n>We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner.<n>Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z)
Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition [53.02634128715853]
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars.<n>We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR.<n>It unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view.
arXiv Detail & Related papers (2025-04-14T10:23:22Z)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [63.54377402784965]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)<n>Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners.<n> Experiments on both the discrete environments (R2R, REVERIE, and R4R dataset) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
DreamRelation: Relation-Centric Video Customization [33.65405972817795]
Video customization refers to the creation of personalized videos that depict user-specified relations between two subjects.<n>While existing methods can personalize subject appearances and motions, they still struggle with complex video customization.<n>We propose DreamRelation, a novel approach capturing a small set of videos, leveraging two key components: Decoupling Learning and Dynamics Enhancement.
arXiv Detail & Related papers (2025-03-10T17:58:03Z)
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models [20.92507667350599]
We introduce a verbalized learning framework named VERA that enables vision-language models to perform video anomaly detection.<n> VERA decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions.<n>During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores.
arXiv Detail & Related papers (2024-12-02T04:10:14Z)
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks. Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z)
RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z)
Enhancing Low-Resource Relation Representations through Multi-View Decoupling [21.32064890807893]
We propose a novel prompt-based relation representation method, named MVRE. MVRE decouples each relation into different perspectives to encompass multi-view relation representations. Our method can achieve state-of-the-art in low-resource settings.
arXiv Detail & Related papers (2023-12-26T14:16:16Z)
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment. Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations. Our framework well preserves the relations between samples. By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.