JavaScript is required to use this site. Please enable JavaScript in your browser settings.

Model and Code Optimization Methods for Energy-efficient Machine Learning

Title: Model and Code Optimization Methods for Energy-efficient Machine Learning

Project duration: 08/2022 – today

Research Area: Software Engineering

Optimizing Machine Learning models is of critical importance to improve performance and energy efficiency. The latter is key not only in IoT edge devices, where inference tasks are deployed in resource-constrained environments, but also in data centers. The rapid growth of AI model sizes and the number of users has driven up electricity consumption. Energy consumption must thus become a primary optimization metric in the development of next-generation AI hardware and software. Machine Learning models offer significant potential for reduction in both computation and memory requirements, particularly in the design of novel hardware accelerators. Our focus lies on post-training analysis of Machine Learning models, conversion techniques and code optimizations aimed at reducing model size and computation complexity while maintaining model accuracy.

Graphic. Model and Code Optimization Methods for Energy-efficient Machine Learning.

Aims

We aim to advance quantization, pruning and bitslicing methods to leverage alternative execution models and design methodologies. This will lead to faster and more energy-efficient inference tasks in ML accelerators.

Problem

Efficient execution of AI models, from both algorithmic and software perspectives, typically involves techniques like quantization, pruning, weight sharing and tailored code optimizations for specific accelerators such as GPUs, AI accelerators like TPUs, ASICs, and data flow processors. Current methods often require retraining or even training models from scratch to achieve substantial reductions in model size and computational complexity.

Practical example

We’ve combined techniques mainly focused on extreme quantization and sparsification of models that result in multiplication-free implementations of common operators in Machine Learning frameworks, such as similarity search, convolutional and fully connected layers. By doing so, we can facilitate more efficient implementations in terms of latency, energy and area, all without sacrificing accuracy or extending training time.

Technology

Our project has optimized Deep Neural Networks and Hyperdimensional Computing models through post-training methods and compiler techniques, such as bitslicing, ternary and binary reformulation of model parameters, distribution-aware quantization, pruning, and computation reuse. For further details and preliminary results, please refer to our publications.

Outlook

Our methods have been tested on GPUs and high-level simulators for emerging hardware accelerators. Moving forward, we will place a particular focus on optimizing for reconfigurable and domain-specific accelerators.

Publications

  • João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, “Full-Stack Optimization for CAM-Only DNN Inference” (to appear), Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024. arXiv preprint arXiv:2309.06418
  • Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, “Hyperdimensional Computing Quantization with Thermometer Codes”, Proceeding: 6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 7pp, Jan 2024.

Team

Lead

Prof. Dr. Jeronimo Castrillon

Team Members

  • Joao Paulo C. de Lima

Partners

  • Nathan Laubeuf, Debjyoti Bhattacharjee (Imec, BE)
  • Caio Vieira, Antonio Carlos Schneider Beck (UFRGS, BR)
funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.