Related papers: Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring

Related papers

Lightweight Hopfield Neural Networks for Bioacoustic Detection and Call Monitoring of Captive Primates [0.0]
We present a transparent, lightweight and fast-to-train associative memory AI model with Hopfield neural network architecture.<n>Adapted from a model developed to detect bat echolocation calls, this model monitors captive endangered black-and-white ruffed lemur Varecia variegata vocalisations.
arXiv Detail & Related papers (2025-11-04T17:46:03Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment [5.380078543698624]
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization.<n>We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization.<n>Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline.
arXiv Detail & Related papers (2025-09-21T05:14:06Z)
AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification [51.525891360380285]
AHDMIL is an Asymmetric Hierarchical Distillation Multi-Instance Learning framework.<n>It eliminates irrelevant patches through a two-step training process.<n>It consistently outperforms previous state-of-the-art methods in both classification performance and inference speed.
arXiv Detail & Related papers (2025-08-07T07:47:16Z)
CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment [0.22499166814992438]
Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers.<n>We introduce a chunk-based approach integrating self-supervised learning (SSL) models selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling.<n>Our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo.
arXiv Detail & Related papers (2025-06-25T08:39:22Z)
Acoustic Classification of Maritime Vessels using Learnable Filterbanks [0.0]
We present a deep learning model with robust performance across different recording scenarios.<n>Trained on the VTUAD hydrophone recordings from the Strait of Georgia, our model, CATFISH, achieves a state-of-the-art 96.63 % percent test accuracy.
arXiv Detail & Related papers (2025-05-29T19:41:15Z)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
Frequency-domain MLPs are More Effective Learners in Time Series Forecasting [67.60443290781988]
Time series forecasting has played the key role in different industrial domains, including finance, traffic, energy, and healthcare. Most-based forecasting methods suffer from the point-wise mappings and information bottleneck. We propose FreTS, a simple yet effective architecture built upon Frequency-domains for Time Series forecasting.
arXiv Detail & Related papers (2023-11-10T17:05:13Z)
Raising the ClaSS of Streaming Time Series Segmentation [3.572107803162503]
We introduce ClaSS, a novel, efficient, and highly accurate algorithm for streaming time series segmentation. ClaSS is significantly more precise than eight state-of-the-art competitors. We also provide ClaSS as a window operator with an average throughput of 1k data points per second for the Apache Flink streaming engine.
arXiv Detail & Related papers (2023-10-31T13:07:41Z)
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z)
Dynamic Spectrum Mixer for Visual Recognition [17.180863898764194]
We propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM) DSM represents token interactions in the frequency domain by employing the Cosine Transform. It can learn long-term spatial dependencies with log-linear complexity.
arXiv Detail & Related papers (2023-09-13T04:51:15Z)
Multi-Object Tracking by Iteratively Associating Detections with Uniform Appearance for Trawl-Based Fishing Bycatch Monitoring [22.228127377617028]
The aim of in-trawl catch monitoring for use in fishing operations is to detect, track and classify fish targets in real-time from video footage. We propose a novel MOT method, built upon an existing observation-centric tracking algorithm, by adopting a new iterative association step. Our method offers improved performance in tracking targets with uniform appearance and outperforms state-of-the-art techniques on our underwater fish datasets as well as the MOT17 dataset.
arXiv Detail & Related papers (2023-04-10T18:55:10Z)
TempNet: Temporal Attention Towards the Detection of Animal Behaviour in Videos [63.85815474157357]
We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos. TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder. We demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events.
arXiv Detail & Related papers (2022-11-17T23:55:12Z)
Balanced Deep CCA for Bird Vocalization Detection [5.635374645175903]
We develop a novel self-supervised learning technique for multi-modal data. We learn (hidden) correlations between simultaneously recorded microphone (sound) signals and accelerometer (body vibration) signals.
arXiv Detail & Related papers (2022-11-17T07:09:07Z)
DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection [16.18806719313959]
We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival. We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
arXiv Detail & Related papers (2021-06-29T09:18:30Z)
Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates [7.1273332508471725]
This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Networks (TBEN) for encoding both audio and visual long range temporal information.
arXiv Detail & Related papers (2020-12-18T14:59:34Z)
Fast accuracy estimation of deep learning based multi-class musical source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z)
Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z)
UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training. We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.