MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing
- URL: http://arxiv.org/abs/2508.14423v3
- Date: Sun, 24 Aug 2025 05:05:33 GMT
- Title: MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing
- Authors: Jeahun Sung, Changhyun Roh, Chanho Eom, Jihyong Oh,
- Abstract summary: frequency aliasing between the camera's color filter array (CFA) and the display's sub-pixels induces moir'e patterns that severely degrade captured photos and videos.<n>MoCHA-former comprises two key components: Decoupled Moir'e Adaptive Demoir'eing (DMAD) and Spatio-Temporal Adaptive Demoir'eing (STAD)<n>We analyze moir'e characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains.
- Score: 9.869634509510014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera's color filter array (CFA) and the display's sub-pixels induces moir\'e patterns that severely degrade captured photos and videos. Although various demoir\'eing models have been proposed to remove such moir\'e patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moir\'e Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moir\'e Adaptive Demoir\'eing (DMAD) and Spatio-Temporal Adaptive Demoir\'eing (STAD). DMAD separates moir\'e and content via a Moir\'e Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moir\'e-adaptive features using a Moir\'e Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moir\'e characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.
Related papers
- DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation [77.89090846233906]
We propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM)<n>DCDM decomposes video consistency modeling into three dedicated components while sharing a unified video generation backbone.<n>We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
arXiv Detail & Related papers (2026-02-14T07:02:36Z) - Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation [88.09231548061295]
Phase-Preserving Diffusion -PD is a model-agnostic reformulation of the diffusion process.<n>-PD preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes.<n>-PD adds no inference-time cost and is compatible with any diffusion model for images or videos.
arXiv Detail & Related papers (2025-12-04T18:59:18Z) - CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z) - Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images [35.42266460525047]
Digital Subtraction Angiography (DSA) images contain complex vascular structures and various motions.<n>Applying natural scene Video Frame Interpolation (VFI) methods results in motion artifacts, structural dissipation, and blurriness.<n>MoSt-DSA has specifically addressed these issues for the first time and achieved SOTA results.<n>We propose GaraMoSt to address these issues within the same computational time scale.
arXiv Detail & Related papers (2024-12-18T18:04:12Z) - Alignment-free Raw Video Demoireing [18.06907326360215]
Video demoireing aims to remove undesirable interference patterns that arise during the capture of screen content.<n>This paper introduces a novel alignment-free raw video demoireing network with frequency-assisted temporal Mamba (DemMamba)<n>It surpasses state-of-the-art methods by 1.6 dB in PSNR, and also delivers a satisfactory visual experience.
arXiv Detail & Related papers (2024-08-20T09:31:03Z) - IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging [26.127764855477782]
Multi-frame high dynamic range ( HDR) imaging aims to reconstruct ghost-free images with photo-realistic details from content-complementary but spatially misaligned low dynamic range (LDR) images.
Existing HDR algorithms are prone to producing ghosting artifacts as their methods fail to capture long-range dependencies between LDR frames with large motion in dynamic scenes.
We propose a novel image fusion transformer, referred to as IFT, which presents a fast global patch searching (FGPS) module followed by a self-cross fusion module (SCF) for ghost-free HDR imaging.
arXiv Detail & Related papers (2023-09-26T15:38:52Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Burstormer: Burst Image Restoration and Enhancement Transformer [117.56199661345993]
On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image.
The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs.
We propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement.
arXiv Detail & Related papers (2023-04-03T17:58:44Z) - Look Back and Forth: Video Super-Resolution with Explicit Temporal
Difference Modeling [105.69197687940505]
We propose to explore the role of explicit temporal difference modeling in both LR and HR space.
To further enhance the super-resolution result, not only spatial residual features are extracted, but the difference between consecutive frames in high-frequency domain is also computed.
arXiv Detail & Related papers (2022-04-14T17:07:33Z) - Implicit Motion-Compensated Network for Unsupervised Video Object
Segmentation [25.41427065435164]
Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence.
Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based)
We propose an implicit motion-compensated network (IMCNet) combining complementary cues ($textiti.e.$, appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level.
arXiv Detail & Related papers (2022-04-06T13:03:59Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Zooming SlowMo: An Efficient One-Stage Framework for Space-Time Video
Super-Resolution [100.11355888909102]
Space-time video super-resolution aims at generating a high-resolution (HR) slow-motion video from a low-resolution (LR) and low frame rate (LFR) video sequence.
We present a one-stage space-time video super-resolution framework, which can directly reconstruct an HR slow-motion video sequence from an input LR and LFR video.
arXiv Detail & Related papers (2021-04-15T17:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.