Related papers: GISE-TTT:A Framework for Global InformationSegmentation and Enhancement

GISE-TTT:A Framework for Global InformationSegmentation and Enhancement

URL: http://arxiv.org/abs/2504.00879v2
Date: Wed, 30 Apr 2025 00:45:55 GMT
Title: GISE-TTT:A Framework for Global InformationSegmentation and Enhancement
Authors: Fenglei Hao, Yuliang Yang, Ruiyuan Su, Zhengran Zhao, Yukun Qiao, Mengyu Zhu,
Abstract summary: GISE-TTT is a novel architecture that integrates Temporal Transformer layers intotransformer-based frameworks.<n>This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object for Video Object (VOS)
Score: 0.1826915781917785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies acrossextended temporal horizons. To overcome this limitation, we introduce GISE-TTT, anovel architecture that integrates Temporal Transformer (TTT) layers intotransformer-based frameworks through a co-designed hierarchical approach.The TTTlayer systematically condenses historical temporal information into hidden states thatencode globally coherent contextual representations. By leveraging multi-stagecontextual aggregation through hierarchical concatenation, our frameworkprogressively refines spatiotemporal dependencies across network layers. This designrepresents the first systematic empirical evidence that distributing global informationacross multiple network layers is critical for optimal dependency utilization in videosegmentation tasks.Ablation studies demonstrate that incorporating TTT modules athigh-level feature stages significantly enhances global modeling capabilities, therebyimproving the network's ability to capture long-range temporal relationships. Extensive experiments on DAVIS 2017 show that GISE-TTT achieves a 3.2%improvement in segmentation accuracy over the baseline model, providingcomprehensive evidence that global information should be strategically leveragedthroughout the network architecture.The code will be made available at:https://github.com/uuool/GISE-TTT.

Related papers

STRAP: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization [34.53308463024231]
We propose an innovative Spatio-Temporal Retrieval-Augmented Pattern Learning framework, STRAP.<n>During inference, STRAP retrieves relevant patterns from this library based on similarity to the current input and injects them into the model via a plug-and-play prompting mechanism.<n>Experiments across multiple real-world streaming graph datasets show that STRAP consistently outperforms state-of-the-art STGNN baselines on STOOD tasks.
arXiv Detail & Related papers (2025-05-26T06:11:05Z)
Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation [70.15341084443236]
We re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks.<n>We propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation.<n>Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features.<n> Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge.
arXiv Detail & Related papers (2025-03-11T04:49:25Z)
HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer [5.96521715927858]
HiFiSeg is a novel network for colon polyp segmentation that enhances high-frequency information processing. GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps.
arXiv Detail & Related papers (2024-10-03T14:36:22Z)
ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis [24.078490055421852]
Whole slide image (WSI) analysis has become increasingly important in the medical imaging community. In this paper, we propose the FIRST continual learning framework for WSI analysis, named ConSlide.
arXiv Detail & Related papers (2023-08-25T11:58:25Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification [3.0831477850153224]
We introduce a novel module called Global-aware Filter layer (GF layer) in this work. We present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV) Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10% together with a relative decline of 20% in computational cost.
arXiv Detail & Related papers (2023-03-20T10:58:12Z)
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task. It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)
Trajectory-User Linking via Hierarchical Spatio-Temporal Attention Networks [39.6505270702036]
Trajectory-User Linking (TUL) is crucial for human mobility modeling by linking trajectories to users. Existing works mainly rely on the neural framework to encode the temporal dependencies in trajectories. This work presents a new hierarchicaltemporal attention neural network called AttnTUL to encode the local trajectory transitional patterns and global spatial dependencies for TUL.
arXiv Detail & Related papers (2023-02-11T06:22:50Z)
Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding. It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z)
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video. We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video. We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z)
TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL) Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG) To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z)
HS3: Learning with Proper Task Complexity in Hierarchically Supervised Semantic Segmentation [81.87943324048756]
We propose Hierarchically Supervised Semantic (HS3), a training scheme that supervises intermediate layers in a segmentation network to learn meaningful representations by varying task complexity. Our proposed HS3-Fuse framework further improves segmentation predictions and achieves state-of-the-art results on two large segmentation benchmarks: NYUD-v2 and Cityscapes.
arXiv Detail & Related papers (2021-11-03T16:33:29Z)
Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames. In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information. The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z)
Spatio-Temporal Representation Factorization for Video-based Person Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID. STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z)
Temporal Context Aggregation Network for Temporal Action Proposal Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field. Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval. We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z)
Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation [144.50154657257605]
We propose an efficient framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module. Our searched architecture, namely Auto-Panoptic, achieves the new state-of-the-art on the challenging COCO and ADE20K benchmarks.
arXiv Detail & Related papers (2020-10-30T08:34:35Z)
Global Context-Aware Progressive Aggregation Network for Salient Object Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features. We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.