KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer

Aness Al-Qawlaq, Ajay Kumar M, Deepu John·July 22, 2024

Summary

The paper explores the adaptation of Transformer models for edge devices by quantizing and hardware-accelerating the ARM Keyword Transformer (KWT) on a RISC-V platform. The KWT-1 model was retrained to be 369 times smaller, with a 10% accuracy loss, by reducing output classes from 35 to 2. This retraining and quantization reduced the model size from 2.42 MB to 1.65 kB. Custom RISC-V instructions were integrated to accelerate GELU and SoftMax operations, resulting in a 5x speedup and approximately 5x power reduction in inference, with a small area overhead of about 29%. The results demonstrate a viable method for porting and accelerating Transformer-based models in low-power IoT devices. The KWT model, built on the Vision Transformer architecture, excels in computer vision tasks, achieving 98.6% accuracy on the Google Speech Commands dataset. It converts raw audio signals into Mel-scale spectrograms, then splits them into time domain patches, applies linear projections, and adds positional embeddings. The model uses self-attention mechanisms to process these patches, reducing signal path length and improving long-range dependency learning. The Transformer's output is normalized and passed through a multilayer perceptron block. However, the model's large parameter count (607k) exceeds the memory capacity (few kB) of low-power embedded systems like the lowRISC Ibex. To address this, a smaller KWT-tiny model was trained by iteratively removing layers with minimal impact on inference accuracy, specifically the depth layers, which are sequential transformer layers. The KWT-Tiny model, a downsized version of the Knowledge-Weighted Transformer (KWT), was developed for embedded applications. It uses 12 fewer layers than the original KWT, which was designed for 35 output classes. KWT-Tiny's input Mel-frequency cepstral coefficients (MFCC) were reduced from 40 to 16 dimensions, balancing memory constraints and accuracy. The model was trained using the Torch-KWT library. KWT-Tiny can only discern two output classes, making it suitable for detecting keywords like "Hey Google" or "Alexa," unlike the 35-class capability of KWT-1. The accuracy of KWT-Tiny was tested on the Google Speech Commands dataset, with "dog" and "notdog" as output classes. Compared to KWT-1, KWT-Tiny showed a 369x reduction in size with a 10% decrease in accuracy. Quantizing KWT-Tiny was crucial for reducing memory usage and computational cost on embedded devices. Post-training static quantization was used, with scale factors chosen as powers of 2 for efficient processing. Intermediate residuals were sized as INT16 to prevent data loss. The optimal scale factor varied between weights and inputs, with KWT-Tiny-Q achieving 25% memory consumption and 82.5% accuracy, a 5% loss compared to KWT-Tiny. The paper introduces a C library for transformer-based computations, aiming to address challenges in deploying models on resource-constrained platforms. The library maximizes customizability and memory optimization, supporting both floating-point and INT16 operations for quantized models. It enables efficient memory usage, with stack size calculated for maximal runtime needs. The library includes functions for layer normalization, matrix multiplication, softmax, gelu, linear, and split operations. To manage intermediate results, a manual malloc() implementation is used, allocating two global memory banks. The paper also discusses hardware acceleration using RISC-V, a modular and extensible instruction set architecture, with the lowRISC Ibex processor. Custom instructions are introduced to compute inference-time operations, enhancing performance on the RISC-V platform. In conclusion, the paper presents a method for adapting Transformer models for edge devices by quantizing and hardware-accelerating the ARM Keyword Transformer (KWT) on a RISC-V platform. The KWT-Tiny model, a downsized version of the Knowledge-Weighted Transformer, was developed for embedded applications, achieving a 369x reduction in size with a 10% decrease in accuracy. The paper also introduces a C library for transformer-based computations and discusses hardware acceleration using custom RISC-V instructions, demonstrating a viable approach for porting and accelerating Transformer-based models in low-power IoT devices.

Key findings

5

Tables

8

Introduction
Background
Overview of Transformer models and their applications in edge devices
Challenges in deploying large models on resource-constrained platforms
Objective
To explore the adaptation of Transformer models for edge devices by quantizing and hardware-accelerating the ARM Keyword Transformer (KWT) on a RISC-V platform
Method
Data Collection
Selection of the ARM Keyword Transformer (KWT) model for adaptation
Retraining of the KWT-1 model to reduce size and accuracy loss
Data Preprocessing
Reduction of output classes from 35 to 2 for KWT-1
Quantization of the KWT-Tiny model for memory and computational cost reduction
Model Adaptation
Custom RISC-V instructions for accelerating GELU and SoftMax operations
Integration of custom instructions to enhance performance on the RISC-V platform
Hardware Acceleration
Utilization of the lowRISC Ibex processor for hardware acceleration
Optimization of the C library for efficient memory usage and operation support
Results
Size reduction of the KWT model from 2.42 MB to 1.65 kB
5x speedup and approximately 5x power reduction in inference
Small area overhead of about 29%
Conclusion
Viability of porting and accelerating Transformer-based models in low-power IoT devices
Summary of the KWT-Tiny model's performance and its suitability for embedded applications
Discussion on the C library's role in maximizing customizability and memory optimization
Future directions for further research and development in hardware-accelerated Transformer models
Basic info
papers
performance
hardware architecture
artificial intelligence
Advanced features
Insights
What were the specific improvements in terms of model size, speed, and power consumption achieved by integrating custom RISC-V instructions?
How was the ARM Keyword Transformer (KWT) retrained and quantized for the RISC-V platform?
What is the main idea of the paper regarding the adaptation of Transformer models for edge devices?
How was the KWT-Tiny model developed, and what were the trade-offs in terms of accuracy and size compared to the original KWT-1 model?