Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification
- URL: http://arxiv.org/abs/2509.15553v1
- Date: Fri, 19 Sep 2025 03:13:58 GMT
- Title: Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification
- Authors: Tian Lan, Yiming Zheng, Jianxin Yin,
- Abstract summary: We introduce textitDiff-Feat, a framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text.<n>We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer.<n>For language tasks, the best feature occurs at the noise-free step and is located in the deepest block.
- Score: 7.9670666100347765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-label classification has broad applications and depends on powerful representations capable of capturing multi-label interactions. We introduce \textit{Diff-Feat}, a simple but powerful framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text, and fuses them for downstream tasks. We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer. In contrast, for language tasks, the best feature occurs at the noise-free step and is located in the deepest block. In particular, we observe a striking phenomenon across varying datasets: a mysterious "Layer $12$" consistently yields the best performance on various downstream classification tasks for images (under DiT-XL/2-256$\times$256). We devise a heuristic local-search algorithm that pinpoints the locally optimal "image-text"$\times$"block-timestep" pair among a few candidates, avoiding an exhaustive grid search. A simple fusion-linear projection followed by addition-of the selected representations yields state-of-the-art performance: 98.6\% mAP on MS-COCO-enhanced and 45.7\% mAP on Visual Genome 500, surpassing strong CNN, graph, and Transformer baselines by a wide margin. t-SNE and clustering metrics further reveal that \textit{Diff-Feat} forms tighter semantic clusters than unimodal counterparts. The code is available at https://github.com/lt-0123/Diff-Feat.
Related papers
- FractMorph: A Fractional Fourier-Based Multi-Domain Transformer for Deformable Image Registration [0.6683923149620578]
We present FractMorph, a novel 3D dual-parallel transformer-based architecture that enhances cross-image feature matching.<n>A lightweight U-Net style network then predicts a dense deformation field from the transformer-enriched features.<n>Results show FractMorph achieves state-of-the-art performance with an overall Dice Similarity Coefficient (DSC) of $86.45%$, an average per-structure of $75.15%$, and a 95th-percentile Hausdorff distance (HD95) of $1.54mathrmmm$ on our data split.
arXiv Detail & Related papers (2025-08-17T17:42:10Z) - FTCFormer: Fuzzy Token Clustering Transformer for Image Classification [22.410199372985584]
Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks.<n>Most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions.<n>We propose Fuzzy Token Clustering Transformer (FTCFormer) to dynamically generate vision tokens based on the semantic meanings instead of spatial positions.
arXiv Detail & Related papers (2025-07-14T13:49:47Z) - Text-Region Matching for Multi-Label Image Recognition with Missing Labels [5.095488730708477]
TRM-ML is a novel method for enhancing meaningful cross-modal matching.
We propose a category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels.
Our proposed framework outperforms the state-of-the-art methods by a significant margin.
arXiv Detail & Related papers (2024-07-26T05:29:24Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - Deep Neural Networks Fused with Textures for Image Classification [20.58839604333332]
Fine-grained image classification is a challenging task in computer vision.
We propose a fusion approach to address FGIC by combining global texture with local patch-based information.
Our method has attained better classification accuracy over existing methods with notable margins.
arXiv Detail & Related papers (2023-08-03T15:21:08Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - Hierarchical Forgery Classifier On Multi-modality Face Forgery Clues [61.37306431455152]
We propose a novel Hierarchical Forgery for Multi-modality Face Forgery Detection (HFC-MFFD)
The HFC-MFFD learns robust patches-based hybrid representation to enhance forgery authentication in multiple-modality scenarios.
The specific hierarchical face forgery is proposed to alleviate the class imbalance problem and further boost detection performance.
arXiv Detail & Related papers (2022-12-30T10:54:29Z) - MlTr: Multi-label Classification with Transformer [35.14232810099418]
We propose a Multi-label Transformer architecture (MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention.
The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE.
arXiv Detail & Related papers (2021-06-11T06:53:09Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.