Title: Model and Code Optimization Methods for Energy-efficient Machine Learning
Project duration: 08/2022 – today
Research Area: Software Engineering
Optimizing Machine Learning models is of critical importance to improve performance and energy efficiency. The latter is key not only in IoT edge devices, where inference tasks are deployed in resource-constrained environments, but also in data centers. The rapid growth of AI model sizes and the number of users has driven up electricity consumption. Energy consumption must thus become a primary optimization metric in the development of next-generation AI hardware and software. Machine Learning models offer significant potential for reduction in both computation and memory requirements, particularly in the design of novel hardware accelerators. Our focus lies on post-training analysis of Machine Learning models, conversion techniques and code optimizations aimed at reducing model size and computation complexity while maintaining model accuracy.
We aim to advance quantization, pruning and bitslicing methods to leverage alternative execution models and design methodologies. This will lead to faster and more energy-efficient inference tasks in ML accelerators.
Efficient execution of AI models, from both algorithmic and software perspectives, typically involves techniques like quantization, pruning, weight sharing and tailored code optimizations for specific accelerators such as GPUs, AI accelerators like TPUs, ASICs, and data flow processors. Current methods often require retraining or even training models from scratch to achieve substantial reductions in model size and computational complexity.
We’ve combined techniques mainly focused on extreme quantization and sparsification of models that result in multiplication-free implementations of common operators in Machine Learning frameworks, such as similarity search, convolutional and fully connected layers. By doing so, we can facilitate more efficient implementations in terms of latency, energy and area, all without sacrificing accuracy or extending training time.
Our project has optimized Deep Neural Networks and Hyperdimensional Computing models through post-training methods and compiler techniques, such as bitslicing, ternary and binary reformulation of model parameters, distribution-aware quantization, pruning, and computation reuse. For further details and preliminary results, please refer to our publications.
Our methods have been tested on GPUs and high-level simulators for emerging hardware accelerators. Moving forward, we will place a particular focus on optimizing for reconfigurable and domain-specific accelerators.
Prof. Dr. Jeronimo Castrillon