RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety
- URL: http://arxiv.org/abs/2504.01128v2
- Date: Thu, 03 Apr 2025 09:29:08 GMT
- Title: RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety
- Authors: Andrei Dumitriu, Florin Tatui, Florin Miron, Aakash Ralhan, Radu Tudor Ionescu, Radu Timofte,
- Abstract summary: RipVIS is a large-scale video instance segmentation benchmark designed for rip current segmentation.<n>Our dataset encompasses diverse visual contexts, such as wave-breaking patterns, sediment flows, and water color variations.<n>Results are reported in terms of multiple metrics, with a particular focus on the $F$ score to prioritize recall and reduce false negatives.
- Score: 57.243502132481176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rip currents are strong, localized and narrow currents of water that flow outwards into the sea, causing numerous beach-related injuries and fatalities worldwide. Accurate identification of rip currents remains challenging due to their amorphous nature and the lack of annotated data, which often requires expert knowledge. To address these issues, we present RipVIS, a large-scale video instance segmentation benchmark explicitly designed for rip current segmentation. RipVIS is an order of magnitude larger than previous datasets, featuring $184$ videos ($212,328$ frames), of which $150$ videos ($163,528$ frames) are with rip currents, collected from various sources, including drones, mobile phones, and fixed beach cameras. Our dataset encompasses diverse visual contexts, such as wave-breaking patterns, sediment flows, and water color variations, across multiple global locations, including USA, Mexico, Costa Rica, Portugal, Italy, Greece, Romania, Sri Lanka, Australia and New Zealand. Most videos are annotated at $5$ FPS to ensure accuracy in dynamic scenarios, supplemented by an additional $34$ videos ($48,800$ frames) without rip currents. We conduct comprehensive experiments with Mask R-CNN, Cascade Mask R-CNN, SparseInst and YOLO11, fine-tuning these models for the task of rip current segmentation. Results are reported in terms of multiple metrics, with a particular focus on the $F_2$ score to prioritize recall and reduce false negatives. To enhance segmentation performance, we introduce a novel post-processing step based on Temporal Confidence Aggregation (TCA). RipVIS aims to set a new standard for rip current segmentation, contributing towards safer beach environments. We offer a benchmark website to share data, models, and results with the research community, encouraging ongoing collaboration and future contributions, at https://ripvis.ai.
Related papers
- Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results [60.656120527353096]
Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide.
We introduce a comprehensive dataset containing $2,466$ images with newly created polygonal annotations for instance segmentation.
We present a novel dataset comprising $17$ drone videos (comprising about $24K$ frames) captured at $30 FPS$, annotated with both polygons for instance segmentation and bounding boxes for object detection.
arXiv Detail & Related papers (2025-04-03T13:14:16Z) - LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and
Benchmark [9.864996020621701]
We present the first maritime panoptic obstacle detection benchmark LaRS, featuring scenes from Lakes, Rivers and Seas.
LaRS is composed of over 4000 per-pixel labeled key frames with nine preceding frames to allow utilization of the temporal texture.
We report the results of 27 semantic and panoptic segmentation methods, along with several performance insights and future research directions.
arXiv Detail & Related papers (2023-08-18T15:21:15Z) - Recurrence without Recurrence: Stable Video Landmark Detection with Deep
Equilibrium Models [96.76758318732308]
We show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation.
Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the WFLW facial landmark dataset.
arXiv Detail & Related papers (2023-04-02T19:08:02Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Occluded Video Instance Segmentation [133.80567761430584]
We collect a large scale dataset called OVIS for occluded video instance segmentation.
OVIS consists of 296k high-quality instance masks from 25 semantic categories.
The highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario.
arXiv Detail & Related papers (2021-02-02T15:35:43Z) - Tamed Warping Network for High-Resolution Semantic Video Segmentation [14.553335231691877]
We build a non-key-frame CNN, fusing warped context features with current spatial details.
Based on the feature fusion, our Context Feature Rectification(CFR) module learns the model's difference from a per-frame model to correct the warped features.
Our Residual-Guided Attention(RGA) module utilizes the residual maps in the compressed domain to help CRF focus on error-prone regions.
arXiv Detail & Related papers (2020-05-04T09:36:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.