Cluster and Aggregate: Face Recognition with Large Probe Set
- URL: http://arxiv.org/abs/2210.10864v1
- Date: Wed, 19 Oct 2022 20:01:15 GMT
- Title: Cluster and Aggregate: Face Recognition with Large Probe Set
- Authors: Minchul Kim, Feng Liu, Anil Jain, Xiaoming Liu
- Abstract summary: We propose a two-stage feature fusion paradigm, Cluster and Aggregate, that can both scale to large $N$ and maintain the ability to perform sequential inference with order invariance.
Experiments on IJB-B and IJB-S benchmark datasets show the superiority of the proposed two-stage paradigm in unconstrained face recognition.
- Score: 18.662943303044315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature fusion plays a crucial role in unconstrained face recognition where
inputs (probes) comprise of a set of $N$ low quality images whose individual
qualities vary. Advances in attention and recurrent modules have led to feature
fusion that can model the relationship among the images in the input set.
However, attention mechanisms cannot scale to large $N$ due to their quadratic
complexity and recurrent modules suffer from input order sensitivity. We
propose a two-stage feature fusion paradigm, Cluster and Aggregate, that can
both scale to large $N$ and maintain the ability to perform sequential
inference with order invariance. Specifically, Cluster stage is a linear
assignment of $N$ inputs to $M$ global cluster centers, and Aggregation stage
is a fusion over $M$ clustered features. The clustered features play an
integral role when the inputs are sequential as they can serve as a
summarization of past features. By leveraging the order-invariance of
incremental averaging operation, we design an update rule that achieves
batch-order invariance, which guarantees that the contributions of early image
in the sequence do not diminish as time steps increase. Experiments on IJB-B
and IJB-S benchmark datasets show the superiority of the proposed two-stage
paradigm in unconstrained face recognition. Code and pretrained models are
available in https://github.com/mk-minchul/caface
Related papers
- Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency [57.961869351897384]
We propose a framework based on cross-modal semantic consistency for efficient image clustering.<n>Our framework first builds a strong foundation via Cross-Modal Semantic Consistency.<n>In the first stage, we train lightweight clustering heads to align with the rich semantics of the pre-trained model.<n>In the second stage, we introduce a Self-Enhanced fine-tuning strategy.
arXiv Detail & Related papers (2025-08-02T08:12:57Z) - InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion [36.27704594180795]
InfiGFusion is a structure-aware fusion framework with a novel textitGraph-on-Logits Distillation (GLD) loss.<n>We show that GLD consistently improves fusion quality and stability.<n>It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT.
arXiv Detail & Related papers (2025-05-20T03:55:35Z) - PRISM: PRogressive dependency maxImization for Scale-invariant image Matching [4.9521269535586185]
We propose PRogressive dependency maxImization for Scale-invariant image Matching (PRISM)
Our method's superior matching performance and generalization capability are confirmed by leading accuracy across various evaluation benchmarks and downstream tasks.
arXiv Detail & Related papers (2024-08-07T07:35:17Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Unsupervised Gait Recognition with Selective Fusion [10.414364995179556]
We propose a new task: Unsupervised Gait Recognition (UGR)
We introduce a new cluster-based baseline to solve UGR with cluster-level contrastive learning.
We propose a Selective Fusion method, which includes Selective Cluster Fusion (SCF) and Selective Sample Fusion (SSF)
arXiv Detail & Related papers (2023-03-19T21:34:20Z) - HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - API: Boosting Multi-Agent Reinforcement Learning via
Agent-Permutation-Invariant Networks [35.63476630248861]
Multi-agent reinforcement learning suffers from poor sample efficiency due to the exponential growth of the state-action space.
We propose two novel designs to achieve permutation invariant (PI)
The first design permutes the same but differently ordered inputs back to the same order and the downstream networks only need to learn function mapping over fixed-ordering inputs.
arXiv Detail & Related papers (2022-03-10T11:00:53Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - GATCluster: Self-Supervised Gaussian-Attention Network for Image
Clustering [9.722607434532883]
We propose a self-supervised clustering network for image Clustering (GATCluster)
Rather than extracting intermediate features first and then performing the traditional clustering, GATCluster semantic cluster labels without further post-processing.
We develop a two-step learning algorithm that is memory-efficient for clustering large-size images.
arXiv Detail & Related papers (2020-02-27T00:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.