Reading a Ruler in the Wild
- URL: http://arxiv.org/abs/2507.07077v1
- Date: Wed, 09 Jul 2025 17:35:58 GMT
- Title: Reading a Ruler in the Wild
- Authors: Yimu Pan, Manas Mehta, Gwen Sincerbeaux, Jeffery A. Goldstein, Alison D. Gernand, James Z. Wang,
- Abstract summary: Accurately converting pixel measurements into absolute real-world dimensions remains a fundamental challenge in computer vision.<n>We introduce RulerNet, a deep learning framework that robustly infers scale "in the wild"<n> Experiments show that RulerNet delivers accurate, consistent, and efficient scale estimates under challenging real-world conditions.
- Score: 1.4785540163232234
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Accurately converting pixel measurements into absolute real-world dimensions remains a fundamental challenge in computer vision and limits progress in key applications such as biomedicine, forensics, nutritional analysis, and e-commerce. We introduce RulerNet, a deep learning framework that robustly infers scale "in the wild" by reformulating ruler reading as a unified keypoint-detection problem and by representing the ruler with geometric-progression parameters that are invariant to perspective transformations. Unlike traditional methods that rely on handcrafted thresholds or rigid, ruler-specific pipelines, RulerNet directly localizes centimeter marks using a distortion-invariant annotation and training strategy, enabling strong generalization across diverse ruler types and imaging conditions while mitigating data scarcity. We also present a scalable synthetic-data pipeline that combines graphics-based ruler generation with ControlNet to add photorealistic context, greatly increasing training diversity and improving performance. To further enhance robustness and efficiency, we propose DeepGP, a lightweight feed-forward network that regresses geometric-progression parameters from noisy marks and eliminates iterative optimization, enabling real-time scale estimation on mobile or edge devices. Experiments show that RulerNet delivers accurate, consistent, and efficient scale estimates under challenging real-world conditions. These results underscore its utility as a generalizable measurement tool and its potential for integration with other vision components for automated, scale-aware analysis in high-impact domains. A live demo is available at https://huggingface.co/spaces/ymp5078/RulerNet-Demo.
Related papers
- Verification of Visual Controllers via Compositional Geometric Transformations [49.81690518952909]
We introduce a novel verification framework for perception-based controllers that can generate outer-approximations of reachable sets.<n>We provide theoretical guarantees on the soundness of our method and demonstrate its effectiveness across benchmark control environments.
arXiv Detail & Related papers (2025-07-06T20:22:58Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z) - A survey on efficient vision transformers: algorithms, techniques, and
performance benchmarking [19.65897437342896]
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications.
This paper mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios.
arXiv Detail & Related papers (2023-09-05T08:21:16Z) - General Neural Gauge Fields [100.35916421218101]
We develop a learning framework to jointly optimize gauge transformations and neural fields.
We derive an information-invariant gauge transformation which allows to preserve scene information inherently and yield superior performance.
arXiv Detail & Related papers (2023-05-05T12:08:57Z) - LGC-Net: A Lightweight Gyroscope Calibration Network for Efficient
Attitude Estimation [10.468378902106613]
We present a calibration neural network model for denoising low-cost microelectromechanical system (MEMS) gyroscope and estimating the attitude of a robot in real-time.
Key idea is extracting local and global features from the time window of inertial measurement units (IMU) measurements to regress the output compensation components for the gyroscope dynamically.
The proposed algorithm is evaluated in the EuRoC and TUM-VI datasets and achieves state-of-the-art on the (unseen) test sequences with a more lightweight model structure.
arXiv Detail & Related papers (2022-09-19T08:03:03Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural
Human Rendering [139.159534903657]
We develop a generalizable and efficient Neural Radiance Field (NeRF) pipeline for high-fidelity free-viewpoint human body details.
To better tackle self-occlusion, we devise a geometry-guided multi-view feature integration approach.
For achieving higher rendering efficiency, we introduce a geometry-guided progressive rendering pipeline.
arXiv Detail & Related papers (2021-12-08T14:42:10Z) - Unsupervised Metric Relocalization Using Transform Consistency Loss [66.19479868638925]
Training networks to perform metric relocalization traditionally requires accurate image correspondences.
We propose a self-supervised solution, which exploits a key insight: localizing a query image within a map should yield the same absolute pose, regardless of the reference image used for registration.
We evaluate our framework on synthetic and real-world data, showing our approach outperforms other supervised methods when a limited amount of ground-truth information is available.
arXiv Detail & Related papers (2020-11-01T19:24:27Z) - Improving the generalization of network based relative pose regression:
dimension reduction as a regularizer [16.63174637692875]
State-of-the-art visual localization methods perform pose estimation using geometry based solver within the RANSAC framework.
End-to-end learning based regression networks provide a solution to circumvent the requirement for precise pixel-level correspondences.
In this paper, we explicitly add a learnable matching layer within the network to isolate the pose regression solver from the absolute image feature values.
We implement this dimension regularization strategy within a two-layer pyramid based framework to regress the localization results from coarse to fine.
arXiv Detail & Related papers (2020-10-24T06:20:46Z) - ProAlignNet : Unsupervised Learning for Progressively Aligning Noisy
Contours [12.791313859673187]
"ProAlignNet" accounts for large scale misalignments and complex transformations between the contour shapes.
It learns by training with a novel loss function which is derived an upperbound of a proximity-sensitive and local shape-dependent similarity metric.
In two real-world applications, the proposed models consistently perform superior to state-of-the-art methods.
arXiv Detail & Related papers (2020-05-23T14:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.