Related papers: Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

URL: http://arxiv.org/abs/2504.05657v1
Date: Tue, 08 Apr 2025 04:11:28 GMT
Title: Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing
Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li,
Abstract summary: Nested Res2Net (Nes2Net) is a lightweight back-end architecture designed to directly process high-dimensional features without DR layers.<n>We report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline.
Score: 56.53218228501566
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.

Related papers

SDS-Net: Shallow-Deep Synergism-detection Network for infrared small target detection [0.18641315013048293]
Current CNN-based infrared small target detection methods overlook the heterogeneity between shallow and deep features.<n>The dependency relationships and fusion mechanisms fail to fully exploit the complementarity of multilevel features.<n>This paper proposes a shallow-deep synergistic detection network (SDS-Net) that efficiently models multilevel feature representations.
arXiv Detail & Related papers (2025-06-06T12:44:41Z)
Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation [0.0]
This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture.<n>It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances.<n> Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 seconds.
arXiv Detail & Related papers (2024-10-15T13:46:19Z)
FeatUp: A Model-Agnostic Framework for Features at Any Resolution [24.4201195336725]
FeatUp is a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
arXiv Detail & Related papers (2024-03-15T17:57:06Z)
NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth. We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z)
Deep Axial Hypercomplex Networks [1.370633147306388]
Recent works make it possible to improve representational capabilities by using hypercomplex-inspired networks. This paper reduces this cost by factorizing a quaternion 2D convolutional module into two consecutive vectormap 1D convolutional modules. Incorporating both yields our proposed hypercomplex network, a novel architecture that can be assembled to construct deep axial-hypercomplex networks.
arXiv Detail & Related papers (2023-01-11T18:31:00Z)
CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D. Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels. To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z)
SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures. The proposed approach can be applied to general backbones like PointNet and DGCNN. Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z)
RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds [4.236277880658203]
We show that RipsNet can estimate topological descriptors on test data very efficiently with generalization capacity. We prove that RipsNet is robust to input perturbations in terms of the 1-Wasserstein distance. We showcase the use of RipsNet on both synthetic and real-world data.
arXiv Detail & Related papers (2022-02-03T17:40:04Z)
a novel attention-based network for fast salient object detection [14.246237737452105]
In the current salient object detection network, the most popular method is using U-shape structure. We propose a new deep convolution network architecture with three contributions. Results demonstrate that the proposed method can compress the model to 1/3 of the original size nearly without losing the accuracy.
arXiv Detail & Related papers (2021-12-20T12:30:20Z)
Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics. Recent neural implicit modeling methods show promising results on synthetic or dense datasets. But, they perform poorly on real-world data that is sparse and noisy. This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
Towards Lossless Binary Convolutional Neural Networks Using Piecewise Approximation [4.023728681102073]
CNNs can significantly reduce the number of arithmetic operations and the size of memory storage. However, the accuracy degradation of single and multiple binary CNNs is unacceptable for modern architectures. We propose a Piecewise Approximation scheme for multiple binary CNNs which lessens accuracy loss by approximating full precision weights and activations.
arXiv Detail & Related papers (2020-08-08T13:32:33Z)
Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters. Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques. We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.