OS-MSL: One Stage Multimodal Sequential Link Framework for Scene
Segmentation and Classification
- URL: http://arxiv.org/abs/2207.01241v1
- Date: Mon, 4 Jul 2022 07:59:34 GMT
- Title: OS-MSL: One Stage Multimodal Sequential Link Framework for Scene
Segmentation and Classification
- Authors: Ye Liu, Lingfeng Qiao, Di Yin, Zhuoxuan Jiang, Xinghua Jiang, Deqiang
Jiang, Bo Ren
- Abstract summary: We propose a general One Stage Multimodal Sequential Link Framework (OS-MSL) to distinguish and leverage the two-fold semantics.
We tailor a specific module called DiffCorrNet to explicitly extract the information of differences and correlations among shots.
- Score: 11.707994658605546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene segmentation and classification (SSC) serve as a critical step towards
the field of video structuring analysis. Intuitively, jointly learning of these
two tasks can promote each other by sharing common information. However, scene
segmentation concerns more on the local difference between adjacent shots while
classification needs the global representation of scene segments, which
probably leads to the model dominated by one of the two tasks in the training
phase. In this paper, from an alternate perspective to overcome the above
challenges, we unite these two tasks into one task by a new form of predicting
shots link: a link connects two adjacent shots, indicating that they belong to
the same scene or category. To the end, we propose a general One Stage
Multimodal Sequential Link Framework (OS-MSL) to both distinguish and leverage
the two-fold semantics by reforming the two learning tasks into a unified one.
Furthermore, we tailor a specific module called DiffCorrNet to explicitly
extract the information of differences and correlations among shots. Extensive
experiments on a brand-new large scale dataset collected from real-world
applications, and MovieScenes are conducted. Both the results demonstrate the
effectiveness of our proposed method against strong baselines.
Related papers
- Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.
We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts.
For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z) - AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation.
We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Integrative Few-Shot Learning for Classification and Segmentation [37.50821005917126]
We introduce the integrative task of few-shot classification and segmentation (FS-CS)
FS-CS aims to classify and segment target objects in a query image when the target classes are given with a few examples.
We propose the integrative few-shot learning framework for FS-CS, which trains a learner to construct class-wise foreground maps.
arXiv Detail & Related papers (2022-03-29T16:14:40Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - BriNet: Towards Bridging the Intra-class and Inter-class Gaps in
One-Shot Segmentation [84.2925550033094]
Few-shot segmentation focuses on the generalization of models to segment unseen object instances with limited training samples.
We propose a framework, BriNet, to bridge the gaps between the extracted features of the query and support images.
The effectiveness of our framework is demonstrated by experimental results, which outperforms other competitive methods.
arXiv Detail & Related papers (2020-08-14T07:45:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.