HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
- URL: http://arxiv.org/abs/2601.04614v1
- Date: Thu, 08 Jan 2026 05:41:06 GMT
- Title: HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
- Authors: Wenzhi Chen, Bo Hu, Leida Li, Lihuo He, Wen Lu, Xinbo Gao,
- Abstract summary: We propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry.<n>First, we extract Euclidean features using CLIP and map them to hyperbolic space.<n>Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision.<n>Third, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters.
- Score: 84.65251073657883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.
Related papers
- Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints [12.704390013489054]
We study zero-shot 3D alignment of two given meshes, using a text prompt describing their relation.<n>We optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients.<n>Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
arXiv Detail & Related papers (2026-01-20T18:12:55Z) - ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points [32.23473666846317]
We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images.<n>Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours.<n>Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines.
arXiv Detail & Related papers (2025-12-08T12:38:11Z) - Dense Semantic Matching with VGGT Prior [49.42199006453071]
We propose an approach that retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences.<n>Our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
arXiv Detail & Related papers (2025-09-25T14:56:11Z) - Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation [62.87088388345378]
We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology.<n>Method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images.<n>Cross-modal attention distillation is proposed to ensure accurate alignment between generated images and geometry.
arXiv Detail & Related papers (2025-06-13T16:19:00Z) - Geometry-Editable and Appearance-Preserving Object Compositon [67.98806888489385]
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties.<n>Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation.<n>We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion model that first leverages semantic embeddings to implicitly capture desired geometric transformations.
arXiv Detail & Related papers (2025-05-27T09:05:28Z) - ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints [13.2441524021269]
ShapeShift is a text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations.<n>We introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur.<n>Our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt.
arXiv Detail & Related papers (2025-03-18T20:48:58Z) - DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry [3.859930277034918]
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD)<n>We propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation.
arXiv Detail & Related papers (2025-03-17T12:34:14Z) - HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space [1.1858475445768824]
This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry.
While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and M"obius transformations.
We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization.
arXiv Detail & Related papers (2024-09-25T13:07:37Z) - From Semantics to Hierarchy: A Hybrid Euclidean-Tangent-Hyperbolic Space Model for Temporal Knowledge Graph Reasoning [1.1372536310854844]
Temporal knowledge graph (TKG) reasoning predicts future events based on historical data.
Existing Euclidean models excel at capturing semantics but struggle with hierarchy.
We propose a novel hybrid geometric space approach that leverages the strengths of both Euclidean and hyperbolic models.
arXiv Detail & Related papers (2024-08-30T10:33:08Z) - Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images [56.86175251327466]
We introduce a novel approach to learn geometries such as depth and surface normal from images while incorporating geometric context.
Our approach extracts geometric context that encodes the geometric variations present in the input image and correlates depth estimation with geometric constraints.
Our method unifies depth and surface normal estimations within a cohesive framework, which enables the generation of high-quality 3D geometry from images.
arXiv Detail & Related papers (2024-02-08T17:57:59Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.