DISeR: Designing Imaging Systems with Reinforcement Learning
- URL: http://arxiv.org/abs/2309.13851v1
- Date: Mon, 25 Sep 2023 03:35:51 GMT
- Title: DISeR: Designing Imaging Systems with Reinforcement Learning
- Authors: Tzofi Klinghoffer, Kushagra Tiwary, Nikhil Behari, Bhavya Agrawalla,
Ramesh Raskar
- Abstract summary: We formulate four building blocks of imaging systems as a context-free grammar (CFG), which can be automatically searched over with a learned camera designer.
We show how the camera designer can be implemented with reinforcement learning to intelligently search over the space of possible imaging system configurations.
- Score: 13.783685993646738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imaging systems consist of cameras to encode visual information about the
world and perception models to interpret this encoding. Cameras contain (1)
illumination sources, (2) optical elements, and (3) sensors, while perception
models use (4) algorithms. Directly searching over all combinations of these
four building blocks to design an imaging system is challenging due to the size
of the search space. Moreover, cameras and perception models are often designed
independently, leading to sub-optimal task performance. In this paper, we
formulate these four building blocks of imaging systems as a context-free
grammar (CFG), which can be automatically searched over with a learned camera
designer to jointly optimize the imaging system with task-specific perception
models. By transforming the CFG to a state-action space, we then show how the
camera designer can be implemented with reinforcement learning to intelligently
search over the combinatorial space of possible imaging system configurations.
We demonstrate our approach on two tasks, depth estimation and camera rig
design for autonomous vehicles, showing that our method yields rigs that
outperform industry-wide standards. We believe that our proposed approach is an
important step towards automating imaging system design.
Related papers
- Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - ChatCam: Empowering Camera Control through Conversational AI [67.31920821192323]
ChatCam is a system that navigates camera movements through conversations with users.
To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation.
We also develop an Anchor Determinator to ensure precise camera trajectory placement.
arXiv Detail & Related papers (2024-09-25T20:13:41Z) - Exploring Camera Encoder Designs for Autonomous Driving Perception [36.65794720685284]
We develop an architecture optimized for AV camera encoder achieving 8.79% mAP improvement over the baseline.
We believe our effort could become a sweet cookbook of image encoders for AV and pave the way to the next-level drive system.
arXiv Detail & Related papers (2024-07-09T23:44:58Z) - Global Search Optics: Automatically Exploring Optimal Solutions to Compact Computational Imaging Systems [15.976326291076377]
The popularity of mobile vision creates a demand for advanced compact computational imaging systems.
Joint design pipelines come to the forefront, where the two significant components are simultaneously optimized via data-driven learning.
In this work, we present Global Search Optimization (GSO) to design compact computational imaging systems.
arXiv Detail & Related papers (2024-04-30T01:59:25Z) - Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs [57.492124844326206]
This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision.
Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks.
arXiv Detail & Related papers (2023-12-12T13:22:44Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Deep Optical Coding Design in Computational Imaging [16.615106763985942]
Computational optical imaging (COI) systems leverage optical coding elements (CE) in their setups to encode a high-dimensional scene in a single or multiple snapshots and decode it by using computational algorithms.
The performance of COI systems highly depends on the design of its main components: the CE pattern and the computational method used to perform a given task.
Deep neural networks (DNNs) have opened a new horizon in CE data-driven designs that jointly consider the optical encoder and computational decoder.
arXiv Detail & Related papers (2022-06-27T04:41:48Z) - Twins: Revisiting Spatial Attention Design in Vision Transformers [81.02454258677714]
In this work, we demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes.
We propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT.
Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks.
arXiv Detail & Related papers (2021-04-28T15:42:31Z) - Nothing But Geometric Constraints: A Model-Free Method for Articulated
Object Pose Estimation [89.82169646672872]
We propose an unsupervised vision-based system to estimate the joint configurations of the robot arm from a sequence of RGB or RGB-D images without knowing the model a priori.
We combine a classical geometric formulation with deep learning and extend the use of epipolar multi-rigid-body constraints to solve this task.
arXiv Detail & Related papers (2020-11-30T20:46:48Z) - A Robotic 3D Perception System for Operating Room Environment Awareness [3.830091185868436]
We describe a 3D multi-view perception system for the da Vinci surgical system to enable Operating room (OR) scene understanding and context awareness.
Based on this architecture, a multi-view 3D scene semantic segmentation algorithm is created.
Our proposed architecture has acceptable registration error ($3.3%pm1.4%$ of object-camera distance) and can robustly improve scene segmentation performance.
arXiv Detail & Related papers (2020-03-20T20:27:06Z) - Redesigning SLAM for Arbitrary Multi-Camera Systems [51.81798192085111]
Adding more cameras to SLAM systems improves robustness and accuracy but complicates the design of the visual front-end significantly.
In this work, we aim at an adaptive SLAM system that works for arbitrary multi-camera setups.
We adapt a state-of-the-art visual-inertial odometry with these modifications, and experimental results show that the modified pipeline can adapt to a wide range of camera setups.
arXiv Detail & Related papers (2020-03-04T11:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.