Related papers: Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning

Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning

URL: http://arxiv.org/abs/2503.04900v1
Date: Thu, 06 Mar 2025 19:02:20 GMT
Title: Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning
Authors: Victor Sebastian Martinez Pozos, Ivan Vladimir Meza Ruiz,
Abstract summary: We propose a novel approach for generating symbolic representations from visual data using self-supervised learning (SSL)<n>An advantage of our method is its interpretability: the sequences are produced by a decoder transformer using cross-attention.<n>This approach lays the foundation for creating interpretable symbolic representations with potential applications in high-level scene understanding.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the potential of abstracting complex visual information into discrete, structured symbolic sequences using self-supervised learning (SSL). Inspired by how language abstracts and organizes information to enable better reasoning and generalization, we propose a novel approach for generating symbolic representations from visual data. To learn these sequences, we extend the DINO framework to handle visual and symbolic information. Initial experiments suggest that the generated symbolic sequences capture a meaningful level of abstraction, though further refinement is required. An advantage of our method is its interpretability: the sequences are produced by a decoder transformer using cross-attention, allowing attention maps to be linked to specific symbols and offering insight into how these representations correspond to image regions. This approach lays the foundation for creating interpretable symbolic representations with potential applications in high-level scene understanding.

Related papers

Hierarchical Process Reward Models are Symbolic Vision Learners [56.94353087007494]
Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision.<n>This requires fundamentally different learning paradigms from pixel-based visual models.<n>We propose a novel self-supervised auto-encoder that encodes diagrams into primitives and decodes them through our executable engine to reconstruct input diagrams.
arXiv Detail & Related papers (2025-12-02T18:46:40Z)
Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models [0.5352699766206808]
Diagrams are crucial yet underexplored tools in many disciplines.<n>Their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital captions.<n>Visual Language Models (VLMs) can help us identify and interpret such hybrid pages in context.
arXiv Detail & Related papers (2025-11-17T13:52:23Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning [86.59849798539312]
We present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations.<n>We show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.
arXiv Detail & Related papers (2024-10-30T16:11:05Z)
What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z)
How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models [2.048226951354646]
An extreme form of such abstract representations is symbols. This work estimates the state of such structured representations in vision encoders. We show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.
arXiv Detail & Related papers (2024-06-13T12:54:20Z)
Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.<n> AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
Discrete, compositional, and symbolic representations through attractor dynamics [51.20712945239422]
We introduce a novel neural systems model that integrates attractor dynamics with symbolic representations to model cognitive processes akin to the probabilistic language of thought (PLoT) Our model segments the continuous representational space into discrete basins, with attractor states corresponding to symbolic sequences, that reflect the semanticity and compositionality characteristic of symbolic systems through unsupervised learning, rather than relying on pre-defined primitives. This approach establishes a unified framework that integrates both symbolic and sub-symbolic processing through neural dynamics, a neuroplausible substrate with proven expressivity in AI, offering a more comprehensive model that mirrors the complex duality of cognitive operations
arXiv Detail & Related papers (2023-10-03T05:40:56Z)
Goal Space Abstraction in Hierarchical Reinforcement Learning via Set-Based Reachability Analysis [0.5409704301731713]
We introduce a Feudal HRL algorithm that concurrently learns both the goal representation and a hierarchical policy. We evaluate our approach on complex navigation tasks, showing the learned representation is interpretable, transferrable and results in data efficient learning.
arXiv Detail & Related papers (2023-09-14T12:39:26Z)
On the Transition from Neural Representation to Symbolic Knowledge [2.2528422603742304]
We propose a Neural-Symbolic Transitional Dictionary Learning (TDL) framework that employs an EM algorithm to learn a transitional representation of data. We implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game. We additionally use RL enabled by the Markovian of diffusion models to further tune the learned prototypes.
arXiv Detail & Related papers (2023-08-03T19:29:35Z)
Symbolic Visual Reinforcement Learning: A Scalable Framework with Object-Level Abstraction and Differentiable Expression Search [63.3745291252038]
We propose DiffSES, a novel symbolic learning approach that discovers discrete symbolic policies. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more scalable than state-of-the-art symbolic RL methods.
arXiv Detail & Related papers (2022-12-30T17:50:54Z)
Compositional Mixture Representations for Vision and Text [43.2292923754127]
A common representation space between vision and language allows deep networks to relate objects in the image to the corresponding semantic meaning. We present a model that learns a shared Gaussian mixture representation imposing the compositionality of the text onto the visual domain without having explicit location supervision.
arXiv Detail & Related papers (2022-06-13T18:16:40Z)
Constellation: Learning relational abstractions over objects for compositional imagination [64.99658940906917]
We introduce Constellation, a network that learns relational abstractions of static visual scenes. This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.
arXiv Detail & Related papers (2021-07-23T11:59:40Z)
Inducing Meaningful Units from Character Sequences with Dynamic Capacity Slot Attention [12.25208417841772]
We propose an unsupervised distributional method to learn the abstract meaningful units in a sequence of characters. Rather than segmenting the sequence, our Dynamic Capacity Slot Attention model discovers continuous representations of the objects in the sequence.
arXiv Detail & Related papers (2021-02-01T23:11:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.