Title: NetPU: Generic Runtime-Reconfigurable Quantized Hardware Accelerator Architecture for Neural Network Inference
Duration: 01/2023 – 12/2025
Research Area: Architectures / Scalability / Security
Growing size of the trained network models in recent research encouraged the design of quantized inference hardware accelerators on edge. Previous works widely explored two architectures: 1). Processing Element Array (PEA) architecture offers a generic inference for different networks with a complex runtime environment. 2). Heterogeneous Streaming Dataflow (HSD) architecture implements customized hardware accelerators for the given trained models with simplified runtime environment control. We explored the design of a hybrid architecture, NetPU, between PEA and HSD architectures to support the runtime-reconfigurable mixed-precision quantized generic network inference. This architecture implements the inference control as hardware, simplifying the runtime environment as data streaming transmission. Moreover, based on the runtime-reconfigurable multi-precision multi-channel multiplier, NetPU can improve the parallel computing performance of low-precision (<8-bit) quantized networks.
NetPU architecture aims to support the generic inference for different mixed-precision quantized network models and their emerging hybrid variants, including MLP, CNN, Transformer, Hybrid ANN-SNN, etc. Based on the reconfigurable neuron processing unit and loop-structured network processing unit, NetPU architecture can support different kinds and sizes of network inference by steaming the configuration data to reset the accelerator function in runtime without re-generation of hardware design.
Considering the limited hardware resources for neural network accelerator design on edge, NetPU architecture must trade-off between the inference latency, hardware resource consumption, and generic support for different network models. The major question behind this project is how to achieve high throughput hardware designs by applying parallel, pipelinized, and systolic array-based technologies for mixed-precision, quantized runtime reconfiguration for generic multi-model network inference.
NetPU architecture is implemented as a pure Verilog project and is evaluated in the current low-energy Ultra96-V2 FPGA (Xilinx Zynq UltraScale+ MPSoC ZU3EG A484) platform. Based on the state-of-the-art research about the Binarized and Quantized Neural Network modeling and accelerator design, we explored the multi-precision operator, reusable loop structure, and generic non-linear activation module design in this project.
NetPU aims to create a generic reconfigurable accelerator architecture for different networks by simply streaming the configuration data to reset the function and behavior of hardware without re-implementation for different models. Moreover, considering the in-built hardware controller, all inference progress and reconfiguration operations are scheduled by data streaming. Therefore, our NetPU architecture has the potential to be organized for cluster acceleration. We will also explore merging the current FPGA instance as the ASIC design. Furthermore, we also plan to test NetPU architecture in potential scenarios, such as bio-image processing.