Rethinking the Architecture Design for Efficient Generic Event Boundary Detection
- URL: http://arxiv.org/abs/2407.12622v1
- Date: Wed, 17 Jul 2024 14:49:54 GMT
- Title: Rethinking the Architecture Design for Efficient Generic Event Boundary Detection
- Authors: Ziwei Zheng, Zechuan Zhang, Yulin Wang, Shiji Song, Gao Huang, Le Yang,
- Abstract summary: Generic (GEBD) is inspired by human visual cognitive cognitive behaviors of consistently segmenting videos into meaningful temporal chunks.
SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios.
We experimentally reexamining the architecture of GEBD models and contribute to addressing this challenge.
- Score: 71.50748944513379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{https://github.com/Ziwei-Zheng/EfficientGEBD}.
Related papers
- Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion [57.09673862519791]
This paper introduces textbfJGA-LBD, a novel framework that unifies the modeling of geometry and appearance into a joint latent representation.<n> Experiments demonstrate that JGA-LBD outperforms current state-of-the-art approaches in terms of both geometry fidelity and appearance quality.
arXiv Detail & Related papers (2026-01-01T12:48:56Z) - Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer [40.778996326009185]
We present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism.<n>Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry.<n>We propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention.
arXiv Detail & Related papers (2025-12-26T06:12:17Z) - PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z) - RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation [75.61028930882144]
We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data.<n>We introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models.<n> RLGF substantially reduces geometric errors (e.g., VP error by 21%, Depth error by 57%) and dramatically improves 3D object detection mAP by 12.7%, narrowing the gap to real-data performance.
arXiv Detail & Related papers (2025-09-20T02:23:36Z) - Retrieval-augmented reasoning with lean language models [5.615564811138556]
We develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries.<n>Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models.<n>All implementation details and code are publicly released to support and adaptation across domains.
arXiv Detail & Related papers (2025-08-15T10:38:15Z) - Time Series Generation Under Data Scarcity: A Unified Generative Modeling Approach [7.631288333466648]
We conduct the first large-scale study evaluating leading generative models in data-scarce settings.<n>We propose a unified diffusion-based generative framework that can synthesize high-fidelity time series using just a few examples.
arXiv Detail & Related papers (2025-05-26T18:39:04Z) - Super-Resolution Generative Adversarial Networks based Video Enhancement [0.40964539027092906]
This work introduces an enhanced approach to video super-resolution by extending ordinary Single-Image-SIS (SRGAN) structure to handle-versarial data.<n>A modified framework that incorporates 3D Non-Local Blocks is developed, which is enabling the model to capture relationships across both spatial and temporal dimensions.<n>Results show improved temporal coherence, sharper textures, and fewer visual artifacts compared to traditional single-image methods.
arXiv Detail & Related papers (2025-05-14T20:16:51Z) - RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement [59.364418120895]
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications.<n>We develop a novel relation-driven Mamba framework for effective UIE (RD-UIE)<n>Experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - BD-Diff: Generative Diffusion Model for Image Deblurring on Unknown Domains with Blur-Decoupled Learning [55.21345354747609]
BD-Diff is a generative-diffusion-based model designed to enhance deblurring performance on unknown domains.
We employ two Q-Formers as structural representations and blur patterns extractors separately.
We introduce a reconstruction task to make the structural features and blur patterns complementary.
arXiv Detail & Related papers (2025-02-03T17:00:40Z) - Multivariate Time-Series Anomaly Detection based on Enhancing Graph Attention Networks with Topological Analysis [31.43159668073136]
Unsupervised anomaly detection in time series is essential in industrial applications, as it significantly reduces the need for manual intervention.
Traditional methods use Graph Neural Networks (GNNs) or Transformers to analyze spatial while RNNs to model temporal dependencies.
This paper introduces a novel temporal model built on an enhanced Graph Attention Network (GAT) for multivariate time series anomaly detection called TopoGDN.
arXiv Detail & Related papers (2024-08-23T14:06:30Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - MAE-GEBD:Winning the CVPR'2023 LOVEU-GEBD Challenge [11.823891739821443]
We build a model for segmenting videos into segments by detecting general event boundaries applicable to various classes.
Based on last year's MAE-GEBD method, we have improved our model performance on the GEBD task by adjusting the data processing strategy and loss function.
With our method, we achieve an F1 score of 86.03% on the Kinetics-GEBD test set, which is a 0.09% improvement in the F1 score compared to our 2022 Kinetics-GEBD method.
arXiv Detail & Related papers (2023-06-27T02:35:19Z) - Graph-based Multi-ODE Neural Networks for Spatio-Temporal Traffic
Forecasting [8.832864937330722]
Long-range traffic forecasting remains a challenging task due to the intricate and extensive-temporal correlations observed in traffic networks.
In this paper, we propose a architecture called Graph-based Multi-ODE Neural Networks (GRAM-ODE) which is designed with multiple connective ODE-GNN modules to learn better representations.
Our extensive set of experiments conducted on six real-world datasets demonstrate the superior performance of GRAM-ODE compared with state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-30T02:10:42Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.