FuguReport

Perceptual Flow Network for Visually Grounded Reasoning

Authors Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang, Zhihang Zhong, Dandan Zheng, Yue Lu
Affiliations Ant Group / East China Normal University / The Hong Kong University of Science and Technology / Shanghai Jiao Tong University / Shanghai AI Laboratory / Sichuan University
Categories Method / Reinforcement Learning / Separation of inference and perception, Method / Reward Engineering / Visual geometry reward shaping, Evaluation / Performance Guarantee / Provable performance assurance and empirical validation
License CC BY 4.0

Abstract Overview

This paper introduces PFlowNet (Perceptual Flow Network), a framework for visually grounded reasoning in Large Vision-Language Models (LVLMs) that decouples perception from reasoning via a structured "perceptual flow" latent trajectory. The authors observe that geometric priors from visual experts (e.g., GroundingDINO) are biased toward localization precision rather than reasoning utility, and that the optimal evidence region is instance-specific. PFlowNet addresses this by using a self-parameterized variational distribution to approximate the posterior of ideal perceptual behaviors, combining supervised cold-start training with variational reinforcement fine-tuning that integrates multi-dimensional rewards and vicinal geometric shaping. The paper provides theoretical analysis showing that PFlowNet's TV distance bound strictly improves over both standard MLE and expert-guided RLVR regimes under proper calibration, and demonstrates competitive empirical results on general-purpose and fine-grained visual reasoning benchmarks.

Novelty

The main novelty is the reformulation of visually grounded reasoning around a structured perceptual flow that separates perception from reasoning and uses self-conditioned generation for the final answer. The training scheme is distinctive in combining a Sub-Trajectory Balance variational objective, a multi-dimensional reward balancing visual reliability (contrastive caption quality) and reasoning utility (information gain for the target answer), and a soft vicinal geometric constraint around expert priors instead of strict imitation.

Results

PFlowNet, built on Qwen3-VL-8B, reports new best results on V* Bench (90.6%) and MME-RealWorld-Lite (67.0%), with gains of +10.4 points on TreeBench and +18.4 points on MME-RealWorld-Lite over its base model. It outperforms prior grounded RLVR and agentic methods on 17/19 sub-tasks across TreeBench and MME-RealWorld-Lite. Theoretical results establish that under calibrated hyperparameters, PFlowNet's TV distance bound strictly improves over both the MLE limit (1−s_V) and the expert-guided RLVR limit (1−q).

Key Points

  1. PFlowNet decouples perceptual flow generation (planning state + grounded observation chain) from answer reasoning, using the sampled flow and corresponding zoomed-in visual features to condition subsequent self-conditioned generation.
  2. Its variational reinforcement fine-tuning combines a Sub-Trajectory Balance objective with a multi-dimensional reward (contrastive caption quality plus reasoning efficacy) and vicinal geometric shaping that penalizes trajectories outside an ε-vicinity of expert priors without forcing exact alignment.
  3. The method is accepted at ICML 2026 and demonstrates strong benchmark performance, including new best scores on V* Bench and MME-RealWorld-Lite, supported by theoretical guarantees showing strict improvement over standard MLE and expert-guided RLVR under proper hyperparameter calibration.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.