Related papers: High Fidelity Interactive Video Segmentation Using Tensor Decomposition Boundary Loss Convolutional Tessellations and Context Aware Skip Connections

High Fidelity Interactive Video Segmentation Using Tensor Decomposition Boundary Loss Convolutional Tessellations and Context Aware Skip Connections

URL: http://arxiv.org/abs/2011.11602v1
Date: Mon, 23 Nov 2020 18:21:42 GMT
Title: High Fidelity Interactive Video Segmentation Using Tensor Decomposition Boundary Loss Convolutional Tessellations and Context Aware Skip Connections
Authors: Anthony D. Rhodes, Manan Goel
Abstract summary: We provide a high fidelity deep learning algorithm (HyperSeg) for interactive video segmentation tasks. Our model crucially processes and renders all image features in high resolution, without utilizing downsampling or pooling procedures. Our work can be used across a broad range of application domains, including VFX pipelines and medical imaging disciplines.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We provide a high fidelity deep learning algorithm (HyperSeg) for interactive video segmentation tasks using a convolutional network with context-aware skip connections, and compressed, hypercolumn image features combined with a convolutional tessellation procedure. In order to maintain high output fidelity, our model crucially processes and renders all image features in high resolution, without utilizing downsampling or pooling procedures. We maintain this consistent, high grade fidelity efficiently in our model chiefly through two means: (1) We use a statistically-principled tensor decomposition procedure to modulate the number of hypercolumn features and (2) We render these features in their native resolution using a convolutional tessellation technique. For improved pixel level segmentation results, we introduce a boundary loss function; for improved temporal coherence in video data, we include temporal image information in our model. Through experiments, we demonstrate the improved accuracy of our model against baseline models for interactive segmentation tasks using high resolution video data. We also introduce a benchmark video segmentation dataset, the VFX Segmentation Dataset, which contains over 27,046 high resolution video frames, including greenscreen and various composited scenes with corresponding, hand crafted, pixel level segmentations. Our work presents an extension to improvement to state of the art segmentation fidelity with high resolution data and can be used across a broad range of application domains, including VFX pipelines and medical imaging disciplines.

Related papers

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control [47.34885131252508]
Video inpainting aims to restore corrupted video content. We propose a novel dual-stream paradigm VideoPainter to process masked videos. We also introduce a novel target region ID resampling technique that enables any-length video inpainting.
arXiv Detail & Related papers (2025-03-07T17:59:46Z)
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames. To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z)
Elevating Flow-Guided Video Inpainting with Reference Generation [50.03502211226332]
Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. We propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts.
arXiv Detail & Related papers (2024-12-12T06:13:00Z)
Transforming Static Images Using Generative Models for Video Salient Object Detection [15.701293552584863]
We show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components. This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements. Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.
arXiv Detail & Related papers (2024-11-21T09:41:33Z)
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity. We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors. I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z)
Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation [47.7036344302777]
Current Object Video reference methods follow the pipeline of extraction-then-matching. We propose a unified VOS framework, coined as JointFormer, for jointly feature modeling, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z)
Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language. Existing methods often fail in generating sufficient feature representations of video content. We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z)
Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance. We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions. We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space. Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z)
Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images. Our core idea is to extract compact basis sets from all the reference images as higher-level representations. We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z)
Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results. We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)
Enhanced Quadratic Video Interpolation [56.54662568085176]
We propose an enhanced quadratic video (EQVI) model to handle more complicated scenes and motion patterns. To further boost the performance, we devise a novel multi-scale fusion network (MS-Fusion) which can be regarded as a learnable augmentation process. The proposed EQVI model won the first place in the AIM 2020 Video Temporal Super-Resolution Challenge.
arXiv Detail & Related papers (2020-09-10T02:31:50Z)
Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network. Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.