Related papers: See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

URL: http://arxiv.org/abs/2512.22120v1
Date: Fri, 26 Dec 2025 18:59:47 GMT
Title: See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang,
Abstract summary: We propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals.<n>BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
Score: 58.7125460363147
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

Related papers

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models [41.59364061354628]
Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt.<n>Existing I2V models prioritize visual consistency.<n>How to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored.
arXiv Detail & Related papers (2026-01-12T07:48:26Z)
VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics [0.0]
Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations.<n>Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts.<n>This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation.
arXiv Detail & Related papers (2025-12-17T15:05:40Z)
PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation [56.238478239463575]
PPBoost transforms weak text-derived signals into strong, spatially grounded visual prompts.<n>It operates under a strict zero-shot regime with no image- or pixel-level segmentation labels.<n>It consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines.
arXiv Detail & Related papers (2025-11-26T23:49:44Z)
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection [49.26064449816502]
We propose a Gradient-based Influence-Aware Constrained Decoding (GACD) method to address text-visual bias and co-occurrence bias.<n>GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
arXiv Detail & Related papers (2025-09-03T08:13:52Z)
Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors [71.44213719783703]
Intra-group Consistency Augmentation Framework (ICAF) developed to label Cadmium Zinc Telluride (CdZnTe) semiconductor images.<n>ICAF consists of two key modules, the View Augmentation Module (VAM) and the View Correction Module (VCM)<n>ICAF achieves a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data.
arXiv Detail & Related papers (2025-08-18T09:40:36Z)
Decouple before Align: Visual Disentanglement Enhances Prompt Tuning [85.91474962071452]
Prompt tuning (PT) has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models.<n>This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context.<n>We propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept.
arXiv Detail & Related papers (2025-08-01T07:46:00Z)
CROP: Contextual Region-Oriented Visual Token Pruning [9.099029419132775]
Contextual Region-Oriented Visual Token Pruning (CROP) is a novel framework to compress visual tokens.<n>Two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early layers guided by the identified contextual region.
arXiv Detail & Related papers (2025-05-27T14:16:52Z)
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning [27.688428439248607]
We introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach.<n>This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream.<n>Our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning.
arXiv Detail & Related papers (2025-05-24T19:30:47Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.