CoverHunter: Cover Song Identification with Refined Attention and
Alignments
- URL: http://arxiv.org/abs/2306.09025v1
- Date: Thu, 15 Jun 2023 10:34:20 GMT
- Title: CoverHunter: Cover Song Identification with Refined Attention and
Alignments
- Authors: Feng Liu, Deyi Tuo, Yinan Xu, Xintong Han
- Abstract summary: Cover song identification (CSI) focuses on finding the same music with different versions in reference anchors given a query track.
We propose a novel system named CoverHunter that overcomes the shortcomings of existing detection schemes.
- Score: 19.173689175634106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Abstract: Cover song identification (CSI) focuses on finding the same music
with different versions in reference anchors given a query track. In this
paper, we propose a novel system named CoverHunter that overcomes the
shortcomings of existing detection schemes by exploring richer features with
refined attention and alignments. CoverHunter contains three key modules: 1) A
convolution-augmented transformer (i.e., Conformer) structure that captures
both local and global feature interactions in contrast to previous methods
mainly relying on convolutional neural networks; 2) An attention-based time
pooling module that further exploits the attention in the time dimension; 3) A
novel coarse-to-fine training scheme that first trains a network to roughly
align the song chunks and then refines the network by training on the aligned
chunks. At the same time, we also summarize some important training tricks used
in our system that help achieve better results. Experiments on several standard
CSI datasets show that our method significantly improves over state-of-the-art
methods with an embedding size of 128 (2.3% on SHS100K-TEST and 17.7% on
DaTacos).
Related papers
- Multi-Correlation Siamese Transformer Network with Dense Connection for
3D Single Object Tracking [14.47355191520578]
Point cloud-based 3D object tracking is an important task in autonomous driving.
It remains challenging to learn the correlation between the template and search branches effectively with the sparse LIDAR point cloud data.
We present a multi-correlation Siamese Transformer network that has multiple stages and carries out feature correlation at the end of each stage.
arXiv Detail & Related papers (2023-12-18T09:33:49Z) - Frequency Perception Network for Camouflaged Object Detection [51.26386921922031]
We propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain.
Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage.
Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets.
arXiv Detail & Related papers (2023-08-17T11:30:46Z) - Clustering based Point Cloud Representation Learning for 3D Analysis [80.88995099442374]
We propose a clustering based supervised learning scheme for point cloud analysis.
Unlike current de-facto, scene-wise training paradigm, our algorithm conducts within-class clustering on the point embedding space.
Our algorithm shows notable improvements on famous point cloud segmentation datasets.
arXiv Detail & Related papers (2023-07-27T03:42:12Z) - TC-Net: Triple Context Network for Automated Stroke Lesion Segmentation [0.5482532589225552]
We propose a new network, Triple Context Network (TC-Net), with the capture of spatial contextual information as the core.
Our network is evaluated on the open dataset ATLAS, achieving the highest score of 0.594, Hausdorff distance of 27.005 mm, and average symmetry surface distance of 7.137 mm.
arXiv Detail & Related papers (2022-02-28T11:12:16Z) - LC3Net: Ladder context correlation complementary network for salient
object detection [0.32116198597240836]
We propose a novel ladder context correlation complementary network (LC3Net)
FCB is a filterable convolution block to assist the automatic collection of information on the diversity of initial features.
DCM is a dense cross module to facilitate the intimate aggregation of different levels of features.
BCD is a bidirectional compression decoder to help the progressive shrinkage of multi-scale features.
arXiv Detail & Related papers (2021-10-21T03:12:32Z) - Supervised Chorus Detection for Popular Music Using Convolutional Neural
Network and Multi-task Learning [10.160205869706965]
This paper presents a novel supervised approach to detecting the chorus segments in popular music.
We propose a convolutional neural network with a multi-task learning objective, which simultaneously fits two temporal activation curves.
We also propose a post-processing method that jointly takes into account the chorus and boundary predictions to produce binary output.
arXiv Detail & Related papers (2021-03-26T04:32:08Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - ByteCover: Cover Song Identification via Multi-Loss Training [20.215501383270706]
ByteCover is a new feature learning method for cover song identification (CSI)
Two major improvements are designed to further enhance the capability of the model for CSI.
A set of experiments demonstrated the effectiveness and efficiency of ByteCover on multiple datasets.
arXiv Detail & Related papers (2020-10-27T02:59:54Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z) - Searching Central Difference Convolutional Networks for Face
Anti-Spoofing [68.77468465774267]
Face anti-spoofing (FAS) plays a vital role in face recognition systems.
Most state-of-the-art FAS methods rely on stacked convolutions and expert-designed network.
Here we propose a novel frame level FAS method based on Central Difference Convolution (CDC)
arXiv Detail & Related papers (2020-03-09T12:48:37Z) - Learning to Hash with Graph Neural Networks for Recommender Systems [103.82479899868191]
Graph representation learning has attracted much attention in supporting high quality candidate search at scale.
Despite its effectiveness in learning embedding vectors for objects in the user-item interaction network, the computational costs to infer users' preferences in continuous embedding space are tremendous.
We propose a simple yet effective discrete representation learning framework to jointly learn continuous and discrete codes.
arXiv Detail & Related papers (2020-03-04T06:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.