Segment and Caption Anything
- URL: http://arxiv.org/abs/2312.00869v2
- Date: Tue, 26 Mar 2024 12:56:55 GMT
- Title: Segment and Caption Anything
- Authors: Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu,
- Abstract summary: We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
- Score: 126.20201216616137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/.
Related papers
- Scalable and Domain-General Abstractive Proposition Segmentation [20.532804009152255]
We focus on the task of abstractive proposition segmentation (APS): transforming text into simple, self-contained, well-formed sentences.
We first introduce evaluation metrics for the task to measure several dimensions of quality.
We then propose a scalable, yet accurate, proposition segmentation model.
arXiv Detail & Related papers (2024-06-28T10:24:31Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language
Guidance [97.00445262074595]
In SemiVL, we propose to integrate rich priors from vision-language models into semi-supervised semantic segmentation.
We design a language-guided decoder to jointly reason over vision and language.
We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - Segment Anything Model is a Good Teacher for Local Feature Learning [19.66262816561457]
Local feature detection and description play an important role in many computer vision tasks.
Data-driven local feature learning methods need to rely on pixel-level correspondence for training.
We propose SAMFeat to introduce SAM as a teacher to guide local feature learning.
arXiv Detail & Related papers (2023-09-29T05:29:20Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Boosting Weakly-Supervised Temporal Action Localization with Text
Information [94.48602948837664]
We propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments.
We also introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence.
Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin.
arXiv Detail & Related papers (2023-05-01T00:07:09Z) - SASFormer: Transformers for Sparsely Annotated Semantic Segmentation [44.758672633271956]
We propose a simple yet effective sparse annotated semantic segmentation framework based on segformer, dubbed SASFormer.
Specifically, the framework first generates hierarchical patch attention maps, which are then multiplied by the network predictions to produce correlated regions separated by valid labels.
arXiv Detail & Related papers (2022-12-05T04:33:12Z) - On the Difficulty of Segmenting Words with Attention [32.97060026226872]
We show, however, that even on monolingual data this approach is brittle.
In experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task.
arXiv Detail & Related papers (2021-09-21T11:37:08Z) - Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually
Grounded Speech [24.187382590960254]
Children do not build their lexicon by segmenting spoken input into phonemes and then building up words from them.
This suggests that the ideal way of learning a language is by starting from full semantic units.
We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient.
arXiv Detail & Related papers (2020-06-15T13:20:13Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.