Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring
- URL: http://arxiv.org/abs/2502.20838v1
- Date: Fri, 28 Feb 2025 08:34:12 GMT
- Title: Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring
- Authors: Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Kazuhiro Nakadai,
- Abstract summary: We introduce DSMIL-LocNet, a framework for whale call detection and localization using only bag-level labels.<n>Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection.
- Score: 2.7418627495572134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection. Tests on Antarctic whale data show longer contexts improve classification (F1: 0.8-0.9) while medium instances ensure localization precision (0.65-0.70). This suggests MIL can enhance scalable marine monitoring. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
Related papers
- Lightweight Hopfield Neural Networks for Bioacoustic Detection and Call Monitoring of Captive Primates [0.0]
We present a transparent, lightweight and fast-to-train associative memory AI model with Hopfield neural network architecture.<n>Adapted from a model developed to detect bat echolocation calls, this model monitors captive endangered black-and-white ruffed lemur Varecia variegata vocalisations.
arXiv Detail & Related papers (2025-11-04T17:46:03Z) - Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment [5.380078543698624]
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization.<n>We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization.<n>Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline.
arXiv Detail & Related papers (2025-09-21T05:14:06Z) - AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification [51.525891360380285]
AHDMIL is an Asymmetric Hierarchical Distillation Multi-Instance Learning framework.<n>It eliminates irrelevant patches through a two-step training process.<n>It consistently outperforms previous state-of-the-art methods in both classification performance and inference speed.
arXiv Detail & Related papers (2025-08-07T07:47:16Z) - CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment [0.22499166814992438]
Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers.<n>We introduce a chunk-based approach integrating self-supervised learning (SSL) models selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling.<n>Our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo.
arXiv Detail & Related papers (2025-06-25T08:39:22Z) - Acoustic Classification of Maritime Vessels using Learnable Filterbanks [0.0]
We present a deep learning model with robust performance across different recording scenarios.<n>Trained on the VTUAD hydrophone recordings from the Strait of Georgia, our model, CATFISH, achieves a state-of-the-art 96.63 % percent test accuracy.
arXiv Detail & Related papers (2025-05-29T19:41:15Z) - SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z) - Frequency-domain MLPs are More Effective Learners in Time Series
Forecasting [67.60443290781988]
Time series forecasting has played the key role in different industrial domains, including finance, traffic, energy, and healthcare.
Most-based forecasting methods suffer from the point-wise mappings and information bottleneck.
We propose FreTS, a simple yet effective architecture built upon Frequency-domains for Time Series forecasting.
arXiv Detail & Related papers (2023-11-10T17:05:13Z) - Raising the ClaSS of Streaming Time Series Segmentation [3.572107803162503]
We introduce ClaSS, a novel, efficient, and highly accurate algorithm for streaming time series segmentation.
ClaSS is significantly more precise than eight state-of-the-art competitors.
We also provide ClaSS as a window operator with an average throughput of 1k data points per second for the Apache Flink streaming engine.
arXiv Detail & Related papers (2023-10-31T13:07:41Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Dynamic Spectrum Mixer for Visual Recognition [17.180863898764194]
We propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM)
DSM represents token interactions in the frequency domain by employing the Cosine Transform.
It can learn long-term spatial dependencies with log-linear complexity.
arXiv Detail & Related papers (2023-09-13T04:51:15Z) - Multi-Object Tracking by Iteratively Associating Detections with Uniform
Appearance for Trawl-Based Fishing Bycatch Monitoring [22.228127377617028]
The aim of in-trawl catch monitoring for use in fishing operations is to detect, track and classify fish targets in real-time from video footage.
We propose a novel MOT method, built upon an existing observation-centric tracking algorithm, by adopting a new iterative association step.
Our method offers improved performance in tracking targets with uniform appearance and outperforms state-of-the-art techniques on our underwater fish datasets as well as the MOT17 dataset.
arXiv Detail & Related papers (2023-04-10T18:55:10Z) - TempNet: Temporal Attention Towards the Detection of Animal Behaviour in
Videos [63.85815474157357]
We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos.
TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder.
We demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events.
arXiv Detail & Related papers (2022-11-17T23:55:12Z) - Balanced Deep CCA for Bird Vocalization Detection [5.635374645175903]
We develop a novel self-supervised learning technique for multi-modal data.
We learn (hidden) correlations between simultaneously recorded microphone (sound) signals and accelerometer (body vibration) signals.
arXiv Detail & Related papers (2022-11-17T07:09:07Z) - DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection [16.18806719313959]
We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival.
We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
arXiv Detail & Related papers (2021-06-29T09:18:30Z) - Temporal Bilinear Encoding Network of Audio-Visual Features at Low
Sampling Rates [7.1273332508471725]
This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate.
We propose Temporal Bilinear Networks (TBEN) for encoding both audio and visual long range temporal information.
arXiv Detail & Related papers (2020-12-18T14:59:34Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Overcoming Classifier Imbalance for Long-tail Object Detection with
Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution.
We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training.
Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.