Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models
- URL: http://arxiv.org/abs/2510.22838v1
- Date: Sun, 26 Oct 2025 21:11:46 GMT
- Title: Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models
- Authors: Aya Nakayama, Brian Wong, Yuji Nishimura, Kaito Tanaka,
- Abstract summary: Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR) is a novel framework for stable semantic understanding and adaptive cross-style visual reasoning.<n>SP-CSVR integrates a Cross-Style Feature (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) to enforce cross-style semantic invariance.
- Score: 0.2833003196854753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The "style trap" poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learning (ICL). Existing methods often fail to effectively decouple style from content, hindering generalization. To address this, we propose the Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR), a novel framework for stable semantic understanding and adaptive cross-style visual reasoning. SP-CSVR integrates a Cross-Style Feature Encoder (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) employing multi-task contrastive learning to enforce cross-style semantic invariance. Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR's state-of-the-art performance across visual captioning, visual question answering, and in-context style adaptation. Comprehensive evaluations, including ablation studies and generalization analysis, confirm SP-CSVR's efficacy in enhancing robustness, generalization, and efficiency across diverse visual styles.
Related papers
- Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration [57.02757226679549]
We introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task.<n>We propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between semantic and style visual tokens.<n>Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality.
arXiv Detail & Related papers (2026-01-10T16:01:14Z) - VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps [3.6380495892295173]
We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance.<n>Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module.<n>Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.
arXiv Detail & Related papers (2025-09-17T20:40:52Z) - Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling [42.46176089721314]
Large Vision and Language Models (LVLMs) have shown strong performance across various vision-language tasks in natural image domains.<n>Their application to remote sensing (RS) remains underexplored due to significant domain differences in visual appearances, object scales, and semantics.<n>We propose a novel LVLM framework tailored for RS understanding, incorporating two core components: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling.
arXiv Detail & Related papers (2025-06-27T02:31:37Z) - Improving vision-language alignment with graph spiking hybrid Networks [10.88584928028832]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features.<n>We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z) - Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries.
One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs.
We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z) - ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model [73.95608242322949]
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images.
We present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion to address challenges such as misinterpreted styles and inconsistent semantics.
arXiv Detail & Related papers (2024-05-24T07:19:40Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval [119.03470556503942]
Crossmodal matching problem is typically solved by learning a joint embedding space where semantic content shared between photo and sketch modalities are preserved.
An effective model needs to explicitly account for this style diversity, crucially, to unseen user styles.
Our model can not only disentangle the cross-modal shared semantic content, but can adapt the disentanglement to any unseen user style as well, making the model truly agnostic.
arXiv Detail & Related papers (2021-03-29T15:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.