TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
- URL: http://arxiv.org/abs/2312.06630v3
- Date: Sun, 17 Mar 2024 20:15:45 GMT
- Title: TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
- Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao,
- Abstract summary: Training on large-scale datasets can boost the performance of video instance segmentation while the datasets for VIS are hard to scale up due to the high labor cost.
What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity.
We conduct extensive evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO.
Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks.
- Score: 48.75470418596875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increases with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomies. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our approach. The code is available at https://github.com/rkzheng99/TMT-VIS .
Related papers
- DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.
In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.
For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - Scaling Sequential Recommendation Models with Transformers [0.0]
We take inspiration from the scaling laws observed in training large language models, and explore similar principles for sequential recommendation.
Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application.
We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains.
arXiv Detail & Related papers (2024-12-10T15:20:56Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - LMSeg: Language-guided Multi-dataset Segmentation [15.624630978858324]
We propose a Language-guided Multi-dataset framework, dubbed LMSeg, which supports both semantic and panoptic segmentation.
LMSeg maps category names to a text embedding space as a unified taxonomy, instead of using inflexible one-hot label.
Experiments demonstrate that our method achieves significant improvements on four semantic and three panoptic segmentation datasets.
arXiv Detail & Related papers (2023-02-27T03:43:03Z) - Automatic universal taxonomies for multi-domain semantic segmentation [1.4364491422470593]
Training semantic segmentation models on multiple datasets has sparked a lot of recent interest in the computer vision community.
established datasets have mutually incompatible labels which disrupt principled inference in the wild.
We address this issue by automatic construction of universal through iterative dataset integration.
arXiv Detail & Related papers (2022-07-18T08:53:17Z) - Beyond Transfer Learning: Co-finetuning for Action Localisation [64.07196901012153]
We propose co-finetuning -- simultaneously training a single model on multiple upstream'' and downstream'' tasks.
We demonstrate that co-finetuning outperforms traditional transfer learning when using the same total amount of data.
We also show how we can easily extend our approach to multiple upstream'' datasets to further improve performance.
arXiv Detail & Related papers (2022-07-08T10:25:47Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - The Devil is in Classification: A Simple Framework for Long-tail Object
Detection and Instance Segmentation [93.17367076148348]
We investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset.
We unveil that a major cause is the inaccurate classification of object proposals.
We propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach.
arXiv Detail & Related papers (2020-07-23T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.