M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on
PRogrammable Multi-Operand Multimode Interference
- URL: http://arxiv.org/abs/2305.19505v2
- Date: Thu, 28 Dec 2023 21:05:09 GMT
- Title: M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on
PRogrammable Multi-Operand Multimode Interference
- Authors: Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Zixuan Jiang, Ray T. Chen, David
Z. Pan
- Abstract summary: Photonic tensor core (PTC) designs based on standard optical components hinder scalability and compute density due to their large spatial footprint.
We propose an ultra-compact PTC using customized programmable multi-operand multimode interference (MOMMI) devices, named M3ICRO.
M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x higher compute density, 3.7-12x higher system throughput, and superior noise robustness.
- Score: 18.0155410476884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Photonic computing shows promise for transformative advancements in machine
learning (ML) acceleration, offering ultra-fast speed, massive parallelism, and
high energy efficiency. However, current photonic tensor core (PTC) designs
based on standard optical components hinder scalability and compute density due
to their large spatial footprint. To address this, we propose an ultra-compact
PTC using customized programmable multi-operand multimode interference (MOMMI)
devices, named M3ICRO. The programmable MOMMI leverages the intrinsic light
propagation principle, providing a single-device programmable matrix unit
beyond the conventional computing paradigm of one multiply-accumulate (MAC)
operation per device. To overcome the optimization difficulty of customized
devices that often requires time-consuming simulation, we apply ML for optics
to predict the device behavior and enable a differentiable optimization flow.
We thoroughly investigate the reconfigurability and matrix expressivity of our
customized PTC, and introduce a novel block unfolding method to fully exploit
the computing capabilities of a complex-valued PTC for near-universal
real-valued linear transformations. Extensive evaluations demonstrate that
M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x
higher compute density, 3.7-12x higher system throughput, and superior noise
robustness compared to state-of-the-art coherent PTC designs, while maintaining
close-to-digital task accuracy across various ML benchmarks. Our code is
open-sourced at https://github.com/JeremieMelo/M3ICRO-MOMMI.
Related papers
- Accelerating TinyML Inference on Microcontrollers through Approximate Kernels [3.566060656925169]
In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on microcontrollers.
Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our solutions can feature on average 21% latency reduction.
arXiv Detail & Related papers (2024-09-25T11:10:33Z) - PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation.
We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel.
We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z) - PowerYOLO: Mixed Precision Model for Hardware Efficient Object Detection with Event Data [0.5461938536945721]
PowerYOLO is a mixed precision solution to the problem of fitting algorithms of high memory and computational complexity into small low-power devices.
First, we propose a system based on a Dynamic Vision Sensor (DVS), a novel sensor, that offers low power requirements.
Second, to ensure high accuracy and low memory and computational complexity, we propose to use 4-bit width Powers-of-Two (PoT) quantisation.
Third, we replace multiplication with bit-shifting to increase the efficiency of hardware acceleration of such solution.
arXiv Detail & Related papers (2024-07-11T08:17:35Z) - SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision
Transformer with Heterogeneous Attention [11.999596399083089]
We propose an MPC-friendly ViT, dubbed MPCViT, to enable accurate yet efficient ViT inference in MPC.
With extensive experiments, we demonstrate that MPCViT achieves 1.9%, 1.3% and 3.6% higher accuracy with 6.2x, 2.9x and 1.9x latency reduction.
arXiv Detail & Related papers (2022-11-25T08:37:17Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with
Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications.
We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS)
Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.