Automated data curation for self-supervised learning in underwater acoustic analysis
- URL: http://arxiv.org/abs/2505.20066v1
- Date: Mon, 26 May 2025 14:50:04 GMT
- Title: Automated data curation for self-supervised learning in underwater acoustic analysis
- Authors: Hilde I Hummel, Sandjai Bhulai, Burooj Ghani, Rob van der Mei,
- Abstract summary: The sustainability of the ocean ecosystem is threatened by increased levels of sound pollution.<n> Passive acoustic monitoring (PAM) systems collect a large amount of underwater sound recordings.<n>Although machine learning offers a potential solution, most underwater acoustic recordings are unlabeled.
- Score: 0.6990493129893112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sustainability of the ocean ecosystem is threatened by increased levels of sound pollution, making monitoring crucial to understand its variability and impact. Passive acoustic monitoring (PAM) systems collect a large amount of underwater sound recordings, but the large volume of data makes manual analysis impossible, creating the need for automation. Although machine learning offers a potential solution, most underwater acoustic recordings are unlabeled. Self-supervised learning models have demonstrated success in learning from large-scale unlabeled data in various domains like computer vision, Natural Language Processing, and audio. However, these models require large, diverse, and balanced datasets for training in order to generalize well. To address this, a fully automated self-supervised data curation pipeline is proposed to create a diverse and balanced dataset from raw PAM data. It integrates Automatic Identification System (AIS) data with recordings from various hydrophones in the U.S. waters. Using hierarchical k-means clustering, the raw audio data is sampled and then combined with AIS samples to create a balanced and diverse dataset. The resulting curated dataset enables the development of self-supervised learning models, facilitating various tasks such as monitoring marine mammals and assessing sound pollution.
Related papers
- Decodable but not structured: linear probing enables Underwater Acoustic Target Recognition with pretrained audio embeddings [1.25052154879199]
anthropogenic noise from ships contribute significantly to underwater sound pollution, posing risks to marine ecosystems.<n> Passive Acoustic Monitoring (PAM) systems are widely deployed for this purpose, generating years of underwater recordings across diverse soundscapes.<n>Recent advances in automatic Underwater Acoustic Target Recognition (UATR) have largely relied on supervised learning, which is constrained by the scarcity of labeled data.<n>In this work, we conduct the first empirical comparative study of transfer learning for UATR, evaluating multiple pretrained audio models originating from diverse audio domains.
arXiv Detail & Related papers (2026-01-13T09:15:31Z) - Machine Learning for Proactive Groundwater Management: Early Warning and Resource Allocation [1.372066170415575]
We develop a machine learning pipeline that predicts groundwater level categories using climate data, hydro-meteorological records, and physiographic attributes.<n>Our approach integrates geospatial preprocessing, domain-driven feature engineering, and automated model selection to overcome monitoring limitations.
arXiv Detail & Related papers (2025-06-18T00:41:04Z) - The Computation of Generalized Embeddings for Underwater Acoustic Target Recognition using Contrastive Learning [0.7145837421668514]
Sound pollution in marine environments poses an increased threat to ocean health.<n>By monitoring this noise, the sources responsible for this pollution can be mapped.<n>This generates a large amount of data records, capturing a mix of sound sources such as ship activities and marine mammal vocalizations.
arXiv Detail & Related papers (2025-05-19T09:37:46Z) - Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation [67.23953699167274]
Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO)<n>In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery.<n>We propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance.
arXiv Detail & Related papers (2025-04-09T15:13:26Z) - RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection [11.265512559447986]
We introduce RU-AI, a new large-scale multimodal dataset for robust and effective detection of machine-generated content in text, image and voice.<n>Our dataset is constructed on the basis of three large publicly available datasets: Flickr8K, COCO and Places205.<n>The results reveal that existing models still struggle to achieve accurate and robust detection on our dataset.
arXiv Detail & Related papers (2024-06-07T12:58:14Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Efficacy of MRI data harmonization in the age of machine learning. A
multicenter study across 36 datasets [0.0]
Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques.
The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data.
When applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance.
arXiv Detail & Related papers (2022-11-08T09:45:39Z) - Representation Learning for the Automatic Indexing of Sound Effects
Libraries [79.68916470119743]
We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size.
Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.
arXiv Detail & Related papers (2022-08-18T23:46:13Z) - Impact of Dataset on Acoustic Models for Automatic Speech Recognition [0.0]
In Automatic Speech Recognition, GMM-HMM had been widely used for acoustic modelling.
The GMM models are widely used to create the alignments of the training data for the hybrid deep neural network model.
This work aims to investigate the impact of dataset size variations on the performance of various GMM-HMM Acoustic Models.
arXiv Detail & Related papers (2022-03-25T11:41:49Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - Parsing Birdsong with Deep Audio Embeddings [0.5599792629509227]
We present a semi-supervised approach to identify characteristic calls and environmental noise.
We utilize several methods to learn a latent representation of audio samples, including a convolutional autoencoder and two pre-trained networks.
arXiv Detail & Related papers (2021-08-20T14:45:44Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.