Hierarchical Process Reward Models are Symbolic Vision Learners
- URL: http://arxiv.org/abs/2512.03126v1
- Date: Tue, 02 Dec 2025 18:46:40 GMT
- Title: Hierarchical Process Reward Models are Symbolic Vision Learners
- Authors: Shan Zhang, Aotian Chen, Kai Zou, Jindong Gu, Yuan Xue, Anton van den Hengel,
- Abstract summary: Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision.<n>This requires fundamentally different learning paradigms from pixel-based visual models.<n>We propose a novel self-supervised auto-encoder that encodes diagrams into primitives and decodes them through our executable engine to reconstruct input diagrams.
- Score: 56.94353087007494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives-points, lines, and shapes-whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is Symbolic Hierarchical Process Reward Modeling, which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.
Related papers
- CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process [29.38618453695266]
Engineering design operates through hierarchical abstraction from system specifications to component implementations.<n>While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored.<n>We present textbfCircuitSense, a benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams.
arXiv Detail & Related papers (2025-09-26T13:32:14Z) - Foundations and Models in Modern Computer Vision: Key Building Blocks in Landmark Architectures [34.542592986038265]
This report analyzes the evolution of key design patterns in computer vision by examining six influential papers.<n>We review ResNet, which introduced residual connections to overcome the vanishing gradient problem.<n>We examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer architecture to sequences of image patches.
arXiv Detail & Related papers (2025-07-31T09:08:11Z) - Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings [67.5600169375126]
We study the task of panoptic symbol spotting in computer-aided design (CAD) drawings composed of vector graphical primitives.<n>Existing methods typically rely on imageization, graph construction, or point-based representation.<n>We propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives.
arXiv Detail & Related papers (2025-05-29T12:33:11Z) - Emergent Language Symbolic Autoencoder (ELSA) with Weak Supervision to Model Hierarchical Brain Networks [0.12075823996747355]
Brain networks display a hierarchical organization, a complexity that poses a challenge for existing deep learning models.
We propose a symbolic autoencoder informed by weak supervision and an Emergent Language (EL) framework.
Our innovation includes a generalized hierarchical loss function designed to ensure that both sentences and images accurately reflect the hierarchical structure of functional brain networks.
arXiv Detail & Related papers (2024-04-15T13:51:05Z) - Discrete, compositional, and symbolic representations through attractor dynamics [51.20712945239422]
We introduce a novel neural systems model that integrates attractor dynamics with symbolic representations to model cognitive processes akin to the probabilistic language of thought (PLoT)
Our model segments the continuous representational space into discrete basins, with attractor states corresponding to symbolic sequences, that reflect the semanticity and compositionality characteristic of symbolic systems through unsupervised learning, rather than relying on pre-defined primitives.
This approach establishes a unified framework that integrates both symbolic and sub-symbolic processing through neural dynamics, a neuroplausible substrate with proven expressivity in AI, offering a more comprehensive model that mirrors the complex duality of cognitive operations
arXiv Detail & Related papers (2023-10-03T05:40:56Z) - LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and
Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge.
During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training.
These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z) - Symbolic Visual Reinforcement Learning: A Scalable Framework with
Object-Level Abstraction and Differentiable Expression Search [63.3745291252038]
We propose DiffSES, a novel symbolic learning approach that discovers discrete symbolic policies.
By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions.
Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more scalable than state-of-the-art symbolic RL methods.
arXiv Detail & Related papers (2022-12-30T17:50:54Z) - Graph-based Neural Modules to Inspect Attention-based Architectures: A
Position Paper [0.0]
encoder-decoder models offer an exciting opportunity for visualization and editing by humans of the knowledge implicitly represented in model weights.
In this work, we explore ways to create an abstraction for segments of the network as a two-way graph-based representation.
Such two-way graph representation enables new neuro-symbolic systems by leveraging the pattern recognition capabilities of the encoder-decoder along with symbolic reasoning carried out on the graphs.
arXiv Detail & Related papers (2022-10-13T15:52:12Z) - pix2rule: End-to-end Neuro-symbolic Rule Learning [84.76439511271711]
This paper presents a complete neuro-symbolic method for processing images into objects, learning relations and logical rules.
The main contribution is a differentiable layer in a deep learning architecture from which symbolic relations and rules can be extracted.
We demonstrate that our model scales beyond state-of-the-art symbolic learners and outperforms deep relational neural network architectures.
arXiv Detail & Related papers (2021-06-14T15:19:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.