Fugu-MT 論文翻訳(概要): VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

論文の概要: VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

arxiv url: http://arxiv.org/abs/2603.25181v1
Date: Thu, 26 Mar 2026 08:51:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.192693
Title: VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers
Title（参考訳）: VolDiT:拡散変換器を用いた可制御容積医用画像合成
Authors: Marvin Seyfarth, Salman Ul Hassan Dar, Yannik Frisch, Philipp Wild, Norbert Frey, Florian André, Sandy Engelhardt,
Abstract要約: VolDiTは、容積医用画像合成のための最初の純粋変換器ベースの3D拡散変換器である。提案手法は,拡散トランスフォーマーをボリュームパッチ埋め込みとグローバル自己注意によりネイティブな3Dデータに拡張する。その結果, グローバルコヒーレンスの向上, 生成能の向上, 制御性の向上が示された。
参考スコア（独自算出の注目度）: 1.2183341965249979
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.
Abstract（参考訳）: 拡散モデルは高忠実度医用画像合成における主要なアプローチとなっている。しかし、既存の医用画像生成手法のほとんどは、潜伏拡散フレームワーク内の畳み込みU-Netバックボーンに依存している。効果はあるものの、これらのアーキテクチャは強い局所性バイアスと限定的な受容場を課し、拡張性、グローバルなコンテキスト統合、フレキシブルな条件付けを制約する可能性がある。本稿では,VolDiTについて紹介する。VolDiTは,容積医用画像合成のための,最初の純粋変換器を用いた3次元拡散変換器である。提案手法は,3Dトークンを直接操作するボリュームパッチ埋め込みやグローバル自己保持を通じて,拡散トランスフォーマーをネイティブな3Dデータに拡張する。構造化制御を実現するために,分割マスクを学習可能な制御トークンにマッピングし,復調時にトランスフォーマー層を変調するタイムステップゲート制御アダプタを提案する。このトークンレベルの条件付け機構は、トランスアーキテクチャのモデリング上の利点を保ちながら、正確な空間ガイダンスを可能にする。我々は,高分解能な3次元医用画像合成タスクの評価を行い,それをU-Netに基づく最先端の3D潜伏拡散モデルと比較した。その結果, グローバルコヒーレンスの向上, 生成能の向上, 制御性の向上が示された。本研究は, フルトランスフォーマーを用いた拡散モデルが, 容積医用画像合成の柔軟な基盤となることを示唆している。公開データでトレーニングされたコードとモデルはhttps://github.com/Cardio-AI/voldit.comで公開されている。

論文の概要: VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

関連論文リスト