See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
- URL: http://arxiv.org/abs/2512.22120v1
- Date: Fri, 26 Dec 2025 18:59:47 GMT
- Title: See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
- Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang,
- Abstract summary: We propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals.<n>BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
- Score: 58.7125460363147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
Related papers
- Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models [41.59364061354628]
Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt.<n>Existing I2V models prioritize visual consistency.<n>How to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored.
arXiv Detail & Related papers (2026-01-12T07:48:26Z) - VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics [0.0]
Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations.<n>Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts.<n>This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation.
arXiv Detail & Related papers (2025-12-17T15:05:40Z) - PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation [56.238478239463575]
PPBoost transforms weak text-derived signals into strong, spatially grounded visual prompts.<n>It operates under a strict zero-shot regime with no image- or pixel-level segmentation labels.<n>It consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines.
arXiv Detail & Related papers (2025-11-26T23:49:44Z) - Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection [49.26064449816502]
We propose a Gradient-based Influence-Aware Constrained Decoding (GACD) method to address text-visual bias and co-occurrence bias.<n>GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
arXiv Detail & Related papers (2025-09-03T08:13:52Z) - Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors [71.44213719783703]
Intra-group Consistency Augmentation Framework (ICAF) developed to label Cadmium Zinc Telluride (CdZnTe) semiconductor images.<n>ICAF consists of two key modules, the View Augmentation Module (VAM) and the View Correction Module (VCM)<n>ICAF achieves a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data.
arXiv Detail & Related papers (2025-08-18T09:40:36Z) - Decouple before Align: Visual Disentanglement Enhances Prompt Tuning [85.91474962071452]
Prompt tuning (PT) has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models.<n>This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context.<n>We propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept.
arXiv Detail & Related papers (2025-08-01T07:46:00Z) - CROP: Contextual Region-Oriented Visual Token Pruning [9.099029419132775]
Contextual Region-Oriented Visual Token Pruning (CROP) is a novel framework to compress visual tokens.<n>Two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early layers guided by the identified contextual region.
arXiv Detail & Related papers (2025-05-27T14:16:52Z) - v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning [27.688428439248607]
We introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach.<n>This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream.<n>Our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning.
arXiv Detail & Related papers (2025-05-24T19:30:47Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.