Case Study
Enhancing Deep Learning on Arm Processors with Embedl’s Hardware Aware Model Optimization
Introduction
The proliferation of AI in edge devices necessitates efficient, high-performing AI models capable of operating within constraints of limited power and computational resources. In this context, optimizing AI models for performance and efficiency is paramount. The Embedl Model Optimization SDK showcases a groundbreaking approach to optimization, leveraging pruning, knowledge distillation, quantization, and neural architecture search techniques to significantly boost performance and accuracy.
Challenge
Optimizing AI models for deployment on Arm Cortex processors is essential for harnessing the full potential of these powerful yet energy-efficient platforms. The challenges include maintaining high accuracy while drastically reducing inference time and power consumption, which is essential for edge deployment in applications such as mobile devices, IoT, and embedded systems.
Solution
Our solution employs multiple state-of-the-art optimization techniques: pruning, neural architecture search, knowledge distillation, and quantization, all tailored to leverage the capabilities of Arm Cortex-A processors. This approach not only enhances model performance but also ensures compatibility and efficiency on Arm-based devices.
Figure A: Embedl’s model optimization SDK delivers state-of-the-art performance of neural networks on embedded hardware. It does this by utilizing hardware-aware quantization, pruning, knowledge distillation, and neural architecture search.
Quantization
Our advanced quantization process optimizes models by reducing the precision of its
parameters, thus significantly decreasing the model size and accelerating inference operations. This technique alone achieves a 2.25x speedup for the MobileNetv2 model with minimal drop in accuracy. Moreover, our unique quantization approach enhances model accuracy to 71% compared to standard optimization tools (14%), a remarkable achievement that addresses the trade-off between speed and accuracy. This is illustrated in Figure B.
Figure B: Models with and without quantization. Embedl provides advanced quantization techniques that fully utilize the capability of ARM architectures to execute quantized operators efficiently, while maintaining more of the original model’s accuracy compared to other tools.
Pruning
By systematically removing non-critical neurons and connections within the MobileNetV2 architecture, our pruning technique significantly reduces model complexity and computational requirements. This process results in a 3.5x speedup in inference times with minimal drop in accuracy, allowing for real-time performance even on the resource-constrained Cortex-A53 and Cortex-72 processors.
Knowledge Distillation
Using the process of knowledge distillation, we are able to utilize the high accuracy of
the larger uncompressed MobileNetv2 model, to enhance the smaller compressed (quantized and pruned) version. The compressed MobileNetV2 model, through this method, retains a higher level of accuracy, closely mirroring the performance of its predecessor. The results of combining quantization, pruning, and knowledge distillation can be seen in Figure C.
Figure C: Combining the techniques of quantization, pruning, and knowledge distillation yields high-performance models that maintain the high accuracy of the original model. With the addition of knowledge distillation, the pruned model gets a 33% lower accuracy drop (1% vs 1.5%) compared to the unpruned model.
Neural Architecture Search
Our Neural Architecture Search (NAS) enables the automated discovery and optimization of deep learning architectures with unprecedented precision. This approach excels in balancing multiple objectives, particularly optimizing for high accuracy on ImageNet classes while simultaneously catering to diverse latency requirements. See Figure D.
NAS pushes the boundaries of what's possible, facilitating the exploration of a vast array of architectures that achieve this delicate balance. The result is a generation of models that not only surpass traditional architectures in accuracy but also in operational efficiency, tailored to specific performance and latency needs. Our unique NAS process yields models that are not only top performers in terms of accuracy but are also remarkably adaptable, capable of meeting the demanding latency constraints of real-world applications without compromise.
Figure D: The top figure illustrates the latency-accuracy trade-off of different variations of MobileNetv2 found with cost-effective neural architecture search, with the pareto frontier of models highlighted in orange. The bottom figure illustrates the same latency-accuracy trade-off but for individual classes of the dataset (observe that the pareto frontier is a different set of models for each individual class). The embedl NAS solution allows for efficient exploration of models and selection of models that fulfill multi-objective criteria without the need to fully re-train them.
Use Case Impact
Optimizing AI models for deployment on Arm Cortex-A processors dramatically transforms capabilities across a range of sectors, particularly in automotive, aerospace, and IoT applications, by enabling real-time image recognition and analysis with unparalleled efficiency. While the specific numbers presented in this article are on a public model, Embedl's customers in these industries benefit from improved performance on their proprietary models.
Conclusion
Embedl’s Model Optimization SDK enhances models when deployed on the Arm Cortex-A53 processor, representing a significant leap forward in edge AI technology. By achieving a 3.5x speedup in performance and a 58% increase in accuracy over standard tools, our solution paves the way for the widespread adoption of advanced AI capabilities in edge devices. This breakthrough demonstrates our commitment to pushing the boundaries of AI optimization, ensuring that the potential of edge computing is fully realized across a wide range of applications.