ATATA: One Algorithm to Align Them All
- URL: http://arxiv.org/abs/2601.11194v1
- Date: Fri, 16 Jan 2026 11:11:33 GMT
- Title: ATATA: One Algorithm to Align Them All
- Authors: Boyi Pang, Savva Ignatyev, Vladimir Ippolitov, Ramil Khafizov, Yurii Melnik, Oleg Voynov, Maksim Nakhodnov, Aibek Alanov, Xiaopeng Fan, Peter Wonka, Evgeny Burnaev,
- Abstract summary: We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models.<n>We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches.
- Score: 74.76451498236437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.
Related papers
- TriaGS: Differentiable Triangulation-Guided Geometric Consistency for 3D Gaussian Splatting [2.441486089588484]
3D Gaussian Splatting is crucial for real-time novel view synthesis due to its efficiency and ability to render images.<n>This paper introduces a novel method that improves reconstruction by enforcing global geometry consistency through constrained multi-view triangulation.<n>We demonstrate the effectiveness of our method across multiple photorealistic datasets, achieving state-of-the-art results.
arXiv Detail & Related papers (2025-12-06T03:45:39Z) - Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image [68.55613894952177]
We introduce textbfWonder3D++, a novel method for efficiently generating high-fidelity textured meshes from single-view images.<n>We propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images.<n> Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner.
arXiv Detail & Related papers (2025-11-03T17:24:18Z) - Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation [62.87088388345378]
We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology.<n>Method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images.<n>Cross-modal attention distillation is proposed to ensure accurate alignment between generated images and geometry.
arXiv Detail & Related papers (2025-06-13T16:19:00Z) - CLR-Wire: Towards Continuous Latent Representations for 3D Curve Wireframe Generation [11.447223770747051]
CLR ContinuousWire encodes curves as Parametric Curves along with their Parametric Curves into a continuous and fixed latent space.<n>This unified approach generates both geometry and topology.
arXiv Detail & Related papers (2025-04-27T09:32:42Z) - MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification [13.872254142378772]
This paper introduces a unified framework for text-to-3D content generation.
Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model.
We also introduce a novel densification algorithm that aligns gaussians close to the surface.
arXiv Detail & Related papers (2024-09-10T16:16:34Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies.<n>AdaDiff is optimized using a policy method to maximize a carefully designed reward function.<n>We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z) - BuilDiff: 3D Building Shape Generation using Single-Image Conditional
Point Cloud Diffusion Models [15.953480573461519]
We propose a novel 3D building shape generation method exploiting point cloud diffusion models with image conditioning schemes.
We validate our framework on two newly built datasets and extensive experiments show that our method outperforms previous works in terms of building generation quality.
arXiv Detail & Related papers (2023-08-31T22:17:48Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.