Related papers: From Editor to Dense Geometry Estimator

From Editor to Dense Geometry Estimator

URL: http://arxiv.org/abs/2509.04338v1
Date: Thu, 04 Sep 2025 15:58:50 GMT
Title: From Editor to Dense Geometry Estimator
Authors: JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao,
Abstract summary: We introduce textbfFE2E, a framework that adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction.<n>FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$times$ data.
Score: 77.21804448599009
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

Related papers

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders [74.72147962028265]
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet.<n>We investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation.
arXiv Detail & Related papers (2026-01-22T18:58:16Z)
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model [32.831576387973875]
We propose a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction.<n>Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective.<n>In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor.
arXiv Detail & Related papers (2025-11-30T18:57:25Z)
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z)
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models [103.18486625853099]
DEFT, Decompositional Efficient Fine-Tuning, adapts a pre-trained weight matrix by decomposing its update into two components.<n>We conduct experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework.
arXiv Detail & Related papers (2025-09-26T18:01:15Z)
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z)
Depth Anything with Any Prior [64.39991799606146]
Prior Depth Anything is a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction.<n>We develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors.<n>Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets.
arXiv Detail & Related papers (2025-05-15T17:59:50Z)
SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection [3.2586315449885106]
We propose a novel encoder-decoder-style neural network called SODAWideNet++ designed explicitly for Salient Object Detection. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end.
arXiv Detail & Related papers (2024-08-29T15:51:06Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed. The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods. It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z)
Recursive Contour Saliency Blending Network for Accurate Salient Object Detection [0.0]
In this work, we designed a network for better edge quality in salient object detection. We proposed a contour-saliency blending module to exchange information between contour and saliency. Our model is lightweight and fast, with only 27.9 million parameters and real-time inferencing at 31 FPS.
arXiv Detail & Related papers (2021-05-28T14:19:54Z)
Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.