Related papers: Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

URL: http://arxiv.org/abs/2512.10955v1
Date: Thu, 11 Dec 2025 18:59:56 GMT
Title: Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov,
Abstract summary: We introduce Omni-Attribute, the first open-vocabulary image attribute encoder to learn attribute-specific representations.<n>We use a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement.<n>The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation.
Score: 82.31106470150844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Related papers

Towards Generalized Multi-Image Editing for Unified Multimodal Models [56.620038824933566]
Unified Multimodal Models (UMMs) integrate multimodal understanding and generation.<n>UMMs are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images.<n>We propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts.
arXiv Detail & Related papers (2026-01-09T06:42:49Z)
ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation [39.34778197087224]
We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance.<n>Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model.<n>This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image.
arXiv Detail & Related papers (2025-09-22T17:59:30Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [78.73711446918814]
We propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve AGReID performance.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models [112.94440113631897]
Current methods attempt to distill identity and style from source images.<n>"style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics.<n>We formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images.
arXiv Detail & Related papers (2024-12-10T17:02:58Z)
DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation [22.599542105037443]
DisEnvisioner is a novel approach for effectively extracting and enriching the subject-essential features while filtering out -irrelevant information. Specifically, the feature of the subject and other irrelevant components are effectively separated into distinctive visual tokens, enabling a much more accurate customization. Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality.
arXiv Detail & Related papers (2024-10-02T22:29:14Z)
CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization [27.114395240088562]
We argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning.<n>Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning paradigm and a Multimodal Feature Injection (MFI)<n> Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
arXiv Detail & Related papers (2024-09-09T13:39:47Z)
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling [32.55352435358949]
We propose a sentence generation-based retrieval formulation for attribute recognition. For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
arXiv Detail & Related papers (2024-08-07T21:44:29Z)
Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval [65.43522019468976]
We propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes. We develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors. Our models are equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities.
arXiv Detail & Related papers (2023-11-21T08:20:38Z)
UMAAF: Unveiling Aesthetics via Multifarious Attributes of Images [16.647573404422175]
We propose the Unified Multi-attribute Aesthetic Assessment Framework (UMAAF) to model both absolute and relative attributes of images. UMAAF achieves state-of-the-art performance on TAD66K and AVA datasets.
arXiv Detail & Related papers (2023-11-19T11:57:01Z)
Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories. In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture. The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
Learning to Infer Unseen Single-/Multi-Attribute-Object Compositions with Graph Networks [47.43595942156663]
In this paper, we propose an attribute-object semantic association graph model to learn the complex relations.<n>With nodes representing attributes and objects, the graph can be constructed flexibly, which realizes both single- and multi-attribute-object composition recognition.
arXiv Detail & Related papers (2020-10-27T14:57:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.