Related papers: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

URL: http://arxiv.org/abs/2507.04946v3
Date: Tue, 30 Sep 2025 11:14:20 GMT
Title: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation
Authors: Jianjiang Yang, Ziyan Huang, Yanshu li, Da Peng, Huaiyuan Yao,
Abstract summary: Text-to-image (T2I) diffusion models exhibit persistent "hallucinations"<n>We propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space.<n>This framework offers a unified and interpretable approach for understanding and mitigating generative failures in T2I systems.
Score: 1.668665305941319
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the Hallucination Tri-Space and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

Related papers

Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution [3.0242762196828448]
Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory.<n>We show that models do not "unlearn" or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement.
arXiv Detail & Related papers (2026-02-04T06:13:11Z)
OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in Sar-to-Optical Translation [12.055938312320402]
A novel SAR-to-Optical (S2O) translation framework is proposed, integrating three core technical contributions.<n>Experiments demonstrate that the proposed method achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.
arXiv Detail & Related papers (2026-01-11T09:57:04Z)
Agentic Retoucher for Text-To-Image Generation [48.80766311858762]
Agentic Retoucher is a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop.<n>This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process.<n>Experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment.
arXiv Detail & Related papers (2026-01-05T12:06:43Z)
Rectifying Latent Space for Generative Single-Image Reflection Removal [16.341477336909765]
Single-image removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions.<n>This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs.
arXiv Detail & Related papers (2025-12-06T09:16:14Z)
AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation [56.399153019429605]
This work shows that ignoring source dynamics yields inconsistent trajectories that suppress or merge semantic cues.<n>We reformulate text-to-3D optimization as mapping a dynamically evolving source distribution to a fixed target distribution.<n>We introduce AnchorDS, an improved score distillation mechanism that provides state-anchored guidance with image conditions.
arXiv Detail & Related papers (2025-11-12T09:51:23Z)
SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning [59.013863248600046]
We propose a spatially re-focused super-resolution framework that refines text conditioning at inference time.<n>Second, we introduce a Spatially Targeted-Free Guidance mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations.
arXiv Detail & Related papers (2025-10-26T05:03:55Z)
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts [80.32933059529135]
Test-Time Adaptation (TTA) methods have emerged to adapt to target distributions during inference.<n>We propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD.<n>In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues.
arXiv Detail & Related papers (2025-08-28T07:09:21Z)
Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion [15.384896404310645]
We propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models.<n>Our method produces high-quality, semantically coherent, and structurally consistent image generations.
arXiv Detail & Related papers (2025-08-13T07:46:00Z)
Veila: Panoramic LiDAR Generation from a Monocular RGB Image [18.511014983119274]
Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics.<n>Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative.<n>We propose Veila, a novel conditional diffusion framework that integrates semantic and depth cues according to their local reliability.
arXiv Detail & Related papers (2025-08-05T17:59:53Z)
Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition [4.196626042312499]
We propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning.<n>Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density.<n>With the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird's-eye view and 3D segment perspectives.
arXiv Detail & Related papers (2025-06-17T07:04:07Z)
SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion [0.0]
We present a novel framework for dynamic 3D scene reconstruction that integrates three key components.<n>An explicit tri-plane deformation field, a view-conditioned canonical field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior.<n>Our method encodes 4D scenes using three 2D feature planes that evolve over time, enabling efficient compact representation.
arXiv Detail & Related papers (2025-05-22T11:25:38Z)
ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis [45.625062335269355]
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images.<n>However, they still struggle to properly render the spatial relationships described in text prompts.<n>Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M.<n>We present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, to enhance spatial consistency in generative models.
arXiv Detail & Related papers (2025-04-18T15:21:37Z)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.<n>Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.<n>We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z)
StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances. First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)
On Robust Cross-View Consistency in Self-Supervised Monocular Depth Estimation [56.97699793236174]
We study two kinds of robust cross-view consistency in this paper. We exploit the temporal coherence in both depth feature space and 3D voxel space for self-supervised monocular depth estimation. Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques.
arXiv Detail & Related papers (2022-09-19T03:46:13Z)
Averaging Spatio-temporal Signals using Optimal Transport and Soft Alignments [110.79706180350507]
We show that our proposed loss can be used to define temporal-temporal baryechecenters as Fr'teche means duality. Experiments on handwritten letters and brain imaging data confirm our theoretical findings.
arXiv Detail & Related papers (2022-03-11T09:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.