Related papers: Dynamic Double Space Tower

Dynamic Double Space Tower

URL: http://arxiv.org/abs/2506.11394v1
Date: Fri, 13 Jun 2025 01:27:45 GMT
Title: Dynamic Double Space Tower
Authors: Weikai Sun, Shijie Song, Han Wang,
Abstract summary: We study a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model.<n>Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision.<n>This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels.
Score: 4.553359878415195
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Visual Question Answering (VQA) task requires the simultaneous understanding of image content and question semantics. However, existing methods often have difficulty handling complex reasoning scenarios due to insufficient cross-modal interaction and capturing the entity spatial relationships in the image.\cite{huang2023adaptive}\cite{liu2021comparing}\cite{guibas2021adaptive}\cite{zhang2022vsa}We studied a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model and its understanding of spatial relationships.Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision. This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels but make judgments based on more meaningful perceptual units. Change from "seeing images" to "perceiving and organizing image content".A large number of experiments have shown that our module can be used in any other multimodal model and achieve advanced results, demonstrating its potential in spatial relationship processing.Meanwhile, the multimodal visual question-answering model July trained by our method has achieved state-of-the-art results with only 3B parameters, especially on the question-answering dataset of spatial relations.

Related papers

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z)
SITE: towards Spatial Intelligence Thorough Evaluation [121.1493852562597]
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships.<n>We introduce SITE, a benchmark dataset towards SI Thorough Evaluation.<n>Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science.
arXiv Detail & Related papers (2025-05-08T17:45:44Z)
DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding [25.32283897448209]
DynamicVis is a dynamic visual perception foundation model for remote sensing imagery.<n>It integrates a novel dynamic region perception backbone based on the selective state space model.<n>It achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's)
arXiv Detail & Related papers (2025-03-20T17:59:54Z)
Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video [3.2195139886901813]
We present a novel framework that learns multi-level semantical dynamics and multi-frame human pose estimation.<n>Specifically, we first design a multi-masked context and pose reconstruction strategy.<n>This strategy stimulates the model to explore multi-temporal semantic relationships among frames by progressively masking the features of optical (patch) cubes and frames.
arXiv Detail & Related papers (2025-02-15T00:35:34Z)
Interpretable deformable image registration: A geometric deep learning perspective [9.13809412085203]
We present a theoretical foundation for designing an interpretable registration framework.<n>We formulate an end-to-end process that refines transformations in a coarse-to-fine fashion.<n>We conclude by showing significant improvement in performance metrics over state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-17T19:47:10Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise. MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z)
Learning In-between Imagery Dynamics via Physical Latent Spaces [0.7366405857677226]
We present a framework designed to learn the underlying dynamics between two images observed at consecutive time steps. By incorporating a latent variable that follows a physical model expressed in partial differential equations (PDEs), our approach ensures the interpretability of the learned model. We demonstrate the robustness and effectiveness of our learning framework through a series of numerical tests using geoscientific imagery data.
arXiv Detail & Related papers (2023-10-14T05:14:51Z)
Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition [5.083140094792973]
SpaCoNet simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation.<n> Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:04:22Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.