MIT boffins cram ML training into microcontroller memory
Neat algorithmic trick squeezing into 256KB of RAM, barely enough for inference let alone teaching
Researchers claim to have developed techniques to enable the training of a machine learning model using less than a quarter of a megabyte of memory, making it suitable for operation in microcontrollers and other edge hardware with limited resources.
The researchers at MIT and the MIT-IBM Watson AI Lab say they have found "algorithmic solutions" that make the training process more efficient and less memory-intensive.
The techniques can be used to train a machine learning model on a microcontroller in a matter of minutes, it is claimed, and they have produced a paper on the subject, titled "On-Device Training Under 256KB Memory" [PDF].
According to the authors, on-device training of a model will enable it to adapt in response to new data collected by the device's sensors. By training and adapting locally at the edge, the model can learn to continuously improve its predictions for the life of the application.
However, the problem with implementing such a solution is that edge devices are often constrained in their memory size and processing power. At one end of the scale, tiny IoT devices based on microcontrollers may have as little as 256KB of SRAM, the paper states, which is barely enough for the inference work of some deep learning models, let alone the training.
Meanwhile, deep learning training systems like PyTorch and TensorFlow are often run on clusters of servers with gigabytes of memory at their disposal, and while there are edge deep learning inference frameworks, some of these lack support for the back-propagation to adjust the models.
In contrast, the intelligent algorithms and framework that the researchers have developed is able to reduce the amount of computation required to train a model, it is claimed.
This is no mean feat, since training a typical deep learning model undergoes hundreds of updates as it learns, and because there may be millions of weights and activations involved, training a model requires much more memory than running a pre-trained model.
(That said, if there are similar projects out there doing non-trivial training on microcontroller devices, let us know.)
One of the MIT solutions developed to make the training process more efficient is sparse update, which skips the gradient computation of less important layers and sub-tensors by using an algorithm to identify only the most important weights to update during each round of training.
The algorithm works by freezing the weights one at a time until it detects the accuracy dip to a set threshold. The remaining weights are then updated, while the activations corresponding to the frozen weights do not need to be stored.
- Someone's at last helping AI models understand those with speech disabilities
- Tesla has a lot of work to do on its Optimus robot
- Text-to-image models are so last month, text-to-video is here
- Europe just might make it easier for people to sue for damage caused by AI tech
"Updating the whole model is very expensive because there are a lot of activations, so people tend to update only the last layer, but as you can imagine, this hurts the accuracy," explained MIT Associate Professor Song Han, one of the paper's authors. "For our method, we selectively update those important weights and make sure the accuracy is fully preserved," he added.
The second solution is to reduce the size of the weights using quantization, typically from 32 bits to just 8 bits, to cut the amount of memory needed for both training and inference. Quantization-aware scaling (QAS) is then used to adjust the ratio between weight and gradient, to avoid any drop in accuracy that may result from training with the quantized values.
The system changes the order of steps in the training process so more work is completed in the compilation stage, before the model is deployed on the edge device, according to Han.
"We push a lot of the computation, such as auto-differentiation and graph optimization, to compile time. We also aggressively prune the redundant operators to support sparse updates. Once at runtime, we have much less workload to do on the device," he said.
The final part of the solution is a lightweight training system, Tiny Training Engine (TTE), that implements these algorithms on a simple microcontroller.
According to the paper, the framework is the first machine learning solution to enable on-device training of convolutional neural networks with a memory budget of less than 256KB.
The authors say that the training system has been demonstrated operating on a commercially available microcontroller, an STM32F746 based on an Arm Cortex-M7 core with 320KB of SRAM and produced by STMicroelectronics.
This was used to train a computer vision model to detect people in images, which it was able to successfully complete after just 10 minutes of training, the research states.
With this success under their belt, the researchers now say they want to apply what they have learned to other machine learning models and types of data, such as language models and time-series data.
They believe these techniques could be used to shrink the size of larger models without sacrificing accuracy, which could help reduce the carbon footprint of training large-scale machine-learning models in future. ®