Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection
- URL: http://arxiv.org/abs/2601.13647v1
- Date: Tue, 20 Jan 2026 06:31:05 GMT
- Title: Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection
- Authors: Yumin Kim, Seonghyeon Go,
- Abstract summary: We propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer.<n>As in our previous work, we extract content embeddings from short music segments using diverse feature extractors.<n>We enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer.
- Score: 1.7034813545878587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works mainly focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our approach outperforms the previous model and recent baselines, achieving state-of-the-art results in AI-generated music detection.
Related papers
- Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - Segment Transformer: AI-Generated Music Detection via Music Structural Analysis [1.7034813545878587]
We aim to improve the accuracy of AIGM detection by analyzing the structural patterns of music segments.<n>Specifically, to extract musical features from short audio clips, we integrated various pre-trained models.<n>For long audio, we developed a segment transformer that divides music into segments and learns inter-segment relationships.
arXiv Detail & Related papers (2025-09-10T04:56:40Z) - Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z) - Detecting Musical Deepfakes [0.0]
This study investigates the detection of AI-generated songs using the FakeMusicCaps dataset.<n>To simulate real-world adversarial conditions, tempo stretching and pitch shifting were applied to the dataset.<n>Mel spectrograms were generated from the modified audio, then used to train and evaluate a convolutional neural network.
arXiv Detail & Related papers (2025-05-03T21:45:13Z) - Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation.<n>High-level semantics are conveyed through a cross-attention mechanism.<n>Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z) - A Comprehensive Survey on Generative AI for Video-to-Music Generation [15.575851379886952]
This paper presents a comprehensive review of video-to-music generation using deep generative AI techniques.<n>We focus on three key components: visual feature extraction, music generation frameworks, and conditioning mechanisms.
arXiv Detail & Related papers (2025-02-18T03:18:54Z) - AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.<n>Our framework incorporates two key components for video understanding and cross-modal learning.<n>Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z) - Video-to-Audio Generation with Hidden Alignment [27.11625918406991]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.<n>We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation [22.28510611697998]
We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task.
Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
arXiv Detail & Related papers (2023-07-25T03:59:04Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.