Slow Perception: Let's Perceive Geometric Figures Step-by-step
- URL: http://arxiv.org/abs/2412.20631v2
- Date: Sun, 26 Jan 2025 23:16:36 GMT
- Title: Slow Perception: Let's Perceive Geometric Figures Step-by-step
- Authors: Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Daxin Jiang,
- Abstract summary: We believe accurate copying (strong perception) is the first step to visual o1.
We introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations.
- Score: 53.69067976062474
- License:
- Abstract: Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.
Related papers
- Recurrent Joint Embedding Predictive Architecture with Recurrent Forward Propagation Learning [0.0]
We introduce a vision network inspired by biological principles.
The network learns by predicting the representation of the next image patch (fixation) based on the sequence of past fixations.
We also introduce emphRecurrent-Forward propagation, a learning algorithm that avoids biologically unrealistic backpropagation through time or memory-inefficient real-time recurrent learning.
arXiv Detail & Related papers (2024-11-10T01:40:42Z) - Toon3D: Seeing Cartoons from New Perspectives [52.85312338932685]
We recover the underlying 3D structure from images of cartoons and anime depicting the same scene.
Our key insight is to deform the input images while recovering camera poses and scene geometry.
Our recovered point clouds can be plugged into novel-view synthesis methods to experience cartoons from viewpoints never drawn before.
arXiv Detail & Related papers (2024-05-16T17:59:51Z) - Geometry-biased Transformers for Novel View Synthesis [36.11342728319563]
We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints.
Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation.
We propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference.
arXiv Detail & Related papers (2023-01-11T18:59:56Z) - PQA: Perceptual Question Answering [35.051664704756995]
Perceptual organization remains one of the very few established theories on the human visual system.
In this paper, we rejuvenate the study of perceptual organization, by advocating two positional changes.
We examine purposefully generated synthetic data, instead of complex real imagery.
We then borrow insights from human psychology to design an agent that casts perceptual organization as a self-attention problem.
arXiv Detail & Related papers (2021-04-08T08:06:21Z) - ShaRF: Shape-conditioned Radiance Fields from a Single View [54.39347002226309]
We present a method for estimating neural scenes representations of objects given only a single image.
The core of our method is the estimation of a geometric scaffold for the object.
We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.
arXiv Detail & Related papers (2021-02-17T16:40:28Z) - Predictive coding feedback results in perceived illusory contours in a
recurrent neural network [0.0]
We equip a deep feedforward convolutional network with brain-inspired recurrent dynamics.
We show that the perception of illusory contours could involve feedback connections.
arXiv Detail & Related papers (2021-02-03T09:07:09Z) - Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature
Learning and Lazy Training [4.318555434063275]
Deep learning algorithms are responsible for a technological revolution in a variety of tasks including image recognition or Go playing.
Yet, why they work is not understood. Ultimately, they manage to classify data lying in high dimension -- a feat generically impossible.
We argue that different learning regimes can be organized into a phase diagram.
arXiv Detail & Related papers (2020-12-30T11:00:36Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z) - Neural Topological SLAM for Visual Navigation [112.73876869904]
We design topological representations for space that leverage semantics and afford approximate geometric reasoning.
We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation.
arXiv Detail & Related papers (2020-05-25T17:56:29Z) - Chained Representation Cycling: Learning to Estimate 3D Human Pose and
Shape by Cycling Between Representations [73.11883464562895]
We propose a new architecture that facilitates unsupervised, or lightly supervised, learning.
We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images.
While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
arXiv Detail & Related papers (2020-01-06T14:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.