LVOS: A Benchmark for Long-term Video Object Segmentation
- URL: http://arxiv.org/abs/2211.10181v2
- Date: Fri, 18 Aug 2023 12:35:59 GMT
- Title: LVOS: A Benchmark for Long-term Video Object Segmentation
- Authors: Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo,
Zhaoyu Chen, Wenqiang Zhang
- Abstract summary: We present a new benchmark dataset named textbfLVOS, which consists of 220 videos with a total duration of 421 minutes.
The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets.
We propose a Diverse Dynamic Memory network (DDMemory) that consists of three complementary memory banks to exploit temporal information adequately.
- Score: 31.76468328063721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing video object segmentation (VOS) benchmarks focus on short-term
videos which just last about 3-5 seconds and where objects are visible most of
the time. These videos are poorly representative of practical applications, and
the absence of long-term datasets restricts further investigation of VOS on the
application in realistic scenarios. So, in this paper, we present a new
benchmark dataset named \textbf{LVOS}, which consists of 220 videos with a
total duration of 421 minutes. To the best of our knowledge, LVOS is the first
densely annotated long-term VOS dataset. The videos in our LVOS last 1.59
minutes on average, which is 20 times longer than videos in existing VOS
datasets. Each video includes various attributes, especially challenges
deriving from the wild, such as long-term reappearing and cross-temporal
similar objeccts.Based on LVOS, we assess existing video object segmentation
algorithms and propose a Diverse Dynamic Memory network (DDMemory) that
consists of three complementary memory banks to exploit temporal information
adequately. The experimental results demonstrate the strength and weaknesses of
prior methods, pointing promising directions for further study. Data and code
are available at https://lingyihongfd.github.io/lvos.github.io/.
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation [29.07092353094942]
Video object segmentation (VOS) aims to distinguish and track target objects in a video.
Existing benchmarks mainly focus on short-term videos, where objects remain visible most of the time.
We propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations.
Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets.
arXiv Detail & Related papers (2024-04-30T07:50:29Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - MOSE: A New Dataset for Video Object Segmentation in Complex Scenes [106.64327718262764]
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
arXiv Detail & Related papers (2023-02-03T17:20:03Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - 5th Place Solution for YouTube-VOS Challenge 2022: Video Object
Segmentation [4.004851693068654]
Video object segmentation (VOS) has made significant progress with the rise of deep learning.
Similar objects are easily confused and tiny objects are difficult to find.
We propose a simple yet effective solution for this task.
arXiv Detail & Related papers (2022-06-20T06:14:27Z) - Dual Temporal Memory Network for Efficient Video Object Segmentation [42.05305410986511]
One of the fundamental challenges in Video Object (VOS) is how to make the most use of the temporal information to boost the performance.
We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories.
Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network.
arXiv Detail & Related papers (2020-03-13T06:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.