Perceptual Flow Network for Visually Grounded Reasoning
Abstract Overview
This paper introduces PFlowNet (Perceptual Flow Network), a framework for visually grounded reasoning in Large Vision-Language Models (LVLMs) that decouples perception from reasoning via a structured "perceptual flow" latent trajectory. The authors observe that geometric priors from visual experts (e.g., GroundingDINO) are biased toward localization precision rather than reasoning utility, and that the optimal evidence region is instance-specific. PFlowNet addresses this by using a self-parameterized variational distribution to approximate the posterior of ideal perceptual behaviors, combining supervised cold-start training with variational reinforcement fine-tuning that integrates multi-dimensional rewards and vicinal geometric shaping. The paper provides theoretical analysis showing that PFlowNet's TV distance bound strictly improves over both standard MLE and expert-guided RLVR regimes under proper calibration, and demonstrates competitive empirical results on general-purpose and fine-grained visual reasoning benchmarks.
Novelty
The main novelty is the reformulation of visually grounded reasoning around a structured perceptual flow that separates perception from reasoning and uses self-conditioned generation for the final answer. The training scheme is distinctive in combining a Sub-Trajectory Balance variational objective, a multi-dimensional reward balancing visual reliability (contrastive caption quality) and reasoning utility (information gain for the target answer), and a soft vicinal geometric constraint around expert priors instead of strict imitation.
Results
PFlowNet, built on Qwen3-VL-8B, reports new best results on V* Bench (90.6%) and MME-RealWorld-Lite (67.0%), with gains of +10.4 points on TreeBench and +18.4 points on MME-RealWorld-Lite over its base model. It outperforms prior grounded RLVR and agentic methods on 17/19 sub-tasks across TreeBench and MME-RealWorld-Lite. Theoretical results establish that under calibrated hyperparameters, PFlowNet's TV distance bound strictly improves over both the MLE limit (1−s_V) and the expert-guided RLVR limit (1−q).
Key Points
- PFlowNet decouples perceptual flow generation (planning state + grounded observation chain) from answer reasoning, using the sampled flow and corresponding zoomed-in visual features to condition subsequent self-conditioned generation.
- Its variational reinforcement fine-tuning combines a Sub-Trajectory Balance objective with a multi-dimensional reward (contrastive caption quality plus reasoning efficacy) and vicinal geometric shaping that penalizes trajectories outside an ε-vicinity of expert priors without forcing exact alignment.
- The method is accepted at ICML 2026 and demonstrates strong benchmark performance, including new best scores on V* Bench and MME-RealWorld-Lite, supported by theoretical guarantees showing strict improvement over standard MLE and expert-guided RLVR under proper hyperparameter calibration.
References
- arXiv: https://arxiv.org/abs/2605.02730v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.02730v1
- Hugging Face Papers: https://huggingface.co/papers/2605.02730