On-Device Training Under 256KB Memory

2206.15472

YC

2

Reddit

20

Published 4/4/2024 by Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

🏋️

Abstract

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • On-device training allows AI models to adapt to new data collected from sensors, protecting user privacy by avoiding cloud data transfer
  • However, the high memory requirements of training make it challenging for IoT devices with limited resources
  • The paper proposes an algorithm-system co-design framework to enable on-device training with only 256KB of memory

Plain English Explanation

The paper explores a way to let IoT devices like smart home sensors or wearables customize their own AI models without having to send private user data to the cloud. Normally, machine learning training requires a lot of memory, way more than what tiny IoT devices have.

The researchers developed a new approach to make on-device training possible even with just 256KB of memory. This is a tiny amount - for comparison, a single high-resolution image can use over 1MB. The key innovations are:

  1. Quantization-Aware Scaling to stabilize the training process when using low-precision 8-bit numbers instead of the usual high-precision floating-point numbers.
  2. Sparse Update to skip computing gradients for parts of the neural network that aren't as important, reducing the memory footprint.
  3. A new training system called Tiny Training Engine that optimizes the computations to further decrease the memory needed.

With these techniques, the researchers were able to train AI models on IoT devices without requiring any additional memory beyond the 256KB already available. This allows devices to personalize their AI in a privacy-preserving way by learning from user data on-device.

Technical Explanation

The paper addresses two unique challenges of on-device training for constrained IoT devices:

  1. Quantized neural network graphs are difficult to optimize due to low bit-precision and lack of normalization layers.
  2. Limited hardware resources prevent the use of full backpropagation training.

To tackle the optimization challenge, the authors propose Quantization-Aware Scaling. This calibrates the gradient scales to stabilize 8-bit quantized training, overcoming the difficulties of low-precision optimization.

To reduce memory footprint, the authors introduce Sparse Update. This skips gradient computation for less important layers and sub-tensors, significantly reducing the memory required.

These algorithmic innovations are implemented in a lightweight training system called Tiny Training Engine. It prunes the backward computation graph to enable the sparse updates, and offloads runtime autodifferentiation to compile time.

The end-to-end framework allows convolutional neural networks to be trained on-device with only 256KB of SRAM and 1MB of Flash memory - over 1000 times less than traditional ML frameworks like PyTorch or TensorFlow. Yet it matches the accuracy of these full-scale systems on a tinyML computer vision task.

Critical Analysis

The paper presents a compelling solution to a key challenge in deploying AI on resource-constrained IoT devices. The techniques of Quantization-Aware Scaling and Sparse Update are novel and well-designed to overcome the unique obstacles of on-device training.

One limitation is that the framework is currently only demonstrated for convolutional neural networks on a single computer vision task. Further research is needed to assess its generalizability to other model architectures and application domains.

Additionally, the paper does not explore the trade-offs between the level of sparsity, training time, and model accuracy. Users may need to experiment to find the right balance for their specific use case.

Overall, this work represents an important step towards enabling lifelong on-device learning for IoT, with compelling implications for privacy-preserving personalization of AI systems. The technical innovations and system-level optimizations provide a strong foundation for future research in this area.

Conclusion

This paper presents a groundbreaking framework that enables on-device training of AI models on IoT devices with only 256KB of memory. By overcoming the challenges of low-precision optimization and limited hardware resources, the researchers have opened the door for IoT devices to continuously adapt and personalize their AI capabilities without compromising user privacy.

The key innovations of Quantization-Aware Scaling and Sparse Update, implemented in the Tiny Training Engine, demonstrate that resource-constrained devices can indeed participate in the benefits of machine learning. This work has significant implications for the future of ubiquitous, intelligent, and privacy-preserving computing at the edge.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏋️

New!TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo

YC

0

Reddit

0

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.

Read more

6/12/2024

DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization

DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization

Behnam Ghavami, Amin Kamjoo, Lesley Shannon, Steve Wilton

YC

0

Reddit

0

The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model accuracy. Our proposed technique, named Post-Training Intra-Layer Multi-Precision Quantization (PTILMPQ), employs a post-training quantization approach, eliminating the need for extensive training data. By estimating the importance of layers and channels within the network, the proposed method enables precise bit allocation throughout the quantization process. Experimental results demonstrate that PTILMPQ offers a promising solution for deploying DNNs on edge devices with restricted memory resources. For instance, in the case of ResNet50, it achieves an accuracy of 74.57% with a memory footprint of 9.5 MB, representing a 25.49% reduction compared to previous similar methods, with only a minor 1.08% decrease in accuracy.

Read more

4/5/2024

🤷

On-device Online Learning and Semantic Management of TinyML Systems

Haoyu Ren, Xue Li, Darko Anicic, Thomas A. Runkler

YC

0

Reddit

0

Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.

Read more

5/17/2024

👀

QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models -- Extended Version

David Campos, Bin Yang, Tung Kieu, Miao Zhang, Chenjuan Guo, Christian S. Jensen

YC

0

Reddit

0

We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and computational capabilities, the full-precision parameters in standard models can be quantized to use fewer bits. The resulting quantized models are then calibrated using back-propagation and full training data to ensure accuracy. This one-time calibration works for deployments in static environments. However, model deployment in dynamic edge environments call for continual calibration to adaptively adjust quantized models to fit new incoming data, which may have different distributions. The first difficulty in enabling continual calibration on the edge is that the full training data may be too large and thus not always available on edge devices. The second difficulty is that the use of back-propagation on the edge for repeated calibration is too expensive. We propose QCore to enable continual calibration on the edge. First, it compresses the full training data into a small subset to enable effective calibration of quantized models with different bit-widths. We also propose means of updating the subset when new streaming data arrives to reflect changes in the environment, while not forgetting earlier training data. Second, we propose a small bit-flipping network that works with the subset to update quantized model parameters, thus enabling efficient continual calibration without back-propagation. An experimental study, conducted with real-world data in a continual learning setting, offers insight into the properties of QCore and shows that it is capable of outperforming strong baseline methods.

Read more

4/23/2024