An Open-Source Framework for Efficient Numerically-Tailored Computations

Louis Ledoux, Marc Casas·May 29, 2024

Summary

The paper introduces an open-source framework for efficient Matrix-Matrix Multiplications (MMMs) in High-Performance Computing (HPC), focusing on custom dot-product operators and support for various formats like Posits, IEEE754, and Bfloat16. The framework offers a customizable arithmetic datapath, seamless integration into programming languages, and a focus on energy efficiency. It demonstrates improvements in AI inference (up to 3.3x energy reduction for ResNet50) and SSH computation (up to 15.1x accuracy improvement), showcasing the benefits of workload-adaptive accumulators in FPGAs without code modifications. The study compares different precision levels and data formats, revealing trade-offs between accuracy and power consumption, and highlights the importance of precision selection based on workload characteristics. The work is supported by grants and encourages further research on low-precision accumulators for future HPC systems.

Key findings

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the issue of adjusting intermediate precision in high-performance computing workloads, specifically focusing on arithmetic datapaths in General Matrix Multiply (GEMM) kernels . This problem is not entirely new, as previous works have proposed solutions like using large scratchpad accumulators or exploring workload-adaptive accumulators on FPGAs to reduce energy costs . However, the paper introduces a novel approach by presenting an open-source software/hardware co-designed framework that allows intuitive adjustments in intermediate precision without the need for code modifications, enabling energy savings without compromising accuracy .

What scientific hypothesis does this paper seek to validate?

This paper aims to validate the hypothesis related to the use of workload-adaptive accumulators to introduce a controlled amount of noise in order to reduce energy costs while measuring the impact on end-to-end workloads in terms of energy and accuracy .

What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper proposes several innovative ideas, methods, and models related to efficient numerically-tailored computations:

The paper introduces new computer formats like Bfloat16, Tapered Floating-Point (TFP), Posit, FP8-E4M3, and FP8alt-E5M2 to address the limitations of the IEEE-754 standard, offering different trade-offs in terms of circuit area and numerical stability .
It highlights the importance of considering the intermediate precision of internal arithmetic datapaths in General Matrix Multiply (GEMM) computations, emphasizing the need for large scratchpad accumulators to address issues arising from vector-dot-products and non-transitive floating-point addition .
The paper explores the use of workload-adaptive accumulators in Field-Programmable Gate Arrays (FPGAs) to introduce controlled noise and reduce energy costs, aiming to optimize energy efficiency while maintaining accuracy in computations .
Additionally, the paper discusses the unresolved issue of adjusting accumulators to measure their impact on end-to-end workloads in terms of energy consumption and accuracy, highlighting the ongoing research in this area .

These proposed ideas and methods aim to enhance the efficiency and accuracy of computations, especially in scenarios where diverse numerical requirements exist, such as in scientific computing and deep neural networks . The paper introduces novel characteristics and advantages compared to previous methods in the field of efficient numerically-tailored computations:

New Computer Formats: The paper proposes the use of innovative computer formats like Bfloat16, Tapered Floating-Point (TFP), Posit, FP8-E4M3, and FP8alt-E5M2 to overcome the limitations of the IEEE-754 standard, offering varied trade-offs in terms of circuit area and numerical stability .
Internal Arithmetic Datapaths: It emphasizes the significance of considering the intermediate precision of internal arithmetic datapaths in General Matrix Multiply (GEMM) computations, highlighting the need for large scratchpad accumulators to address issues arising from vector-dot-products and non-transitive floating-point addition .
Workload-Adaptive Accumulators: The paper explores the use of workload-adaptive accumulators in Field-Programmable Gate Arrays (FPGAs) to introduce controlled noise and reduce energy costs, aiming to optimize energy efficiency while maintaining accuracy in computations .
Parameter Tuning for Efficiency: By carefully tuning the parameters of accumulators, the paper achieves significant power savings without compromising accuracy, demonstrating the potential of using low-precision accumulators for AI workloads .
Open-Source Framework: The paper presents an open-source software/hardware co-designed framework that enables intuitive intermediate precision adjustments in high-end software code without requiring code modifications. This framework facilitates accuracy and energy tradeoffs by adjusting LUTs/FFs/DSPs in arithmetic datapaths and seamlessly exposes automated pipeline systolic MMM kernels to the software code .

These characteristics and advancements offer a comprehensive approach to enhancing computational efficiency, accuracy, and energy optimization in various numerical computing scenarios, providing valuable insights for future research and development in the field .

Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related researches exist in the field, with notable researchers contributing to advancements in the area of efficient numerically-tailored computations. Some noteworthy researchers in this field include:

J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, J. Wawrzynek, and K. Asanovi´c, who worked on "Chisel: constructing hardware in a Scala embedded language" .
D. H. Bailey, J. M. Borwein, and R. E. Crandall, who explored "Integrals of the Ising class" .
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and others, who contributed to "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems" .
R. Jain, N. Sharma, F. Merchant, S. Patkar, and R. Leupers, who worked on "CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism" .

The key to the solution mentioned in the paper involves the development of an open-source software/hardware co-designed framework that allows for intuitive intermediate precision adjustments in high-end software code. This framework enables accuracy and energy tradeoffs by adjusting arithmetic datapaths and providing automated pipeline systolic MMM kernels to the software code, resulting in substantial energy savings during validation dataset inference without compromising accuracy .

How were the experiments in the paper designed?

The experiments in the paper were designed to evaluate the effectiveness of hardware units by comparing Fused Dot Products (FDPs) to double- and quad-precision Fused Multiply-Add (FMA) operations commonly found in computational systems. The study focused on hardware units such as IEEE-754 double-precision FMA, IEEE-754 quad-precision FMA, and a 91-bit FDP fed with IEEE754-64 words. The experiments involved analyzing data related to the average, relative standard deviation, accuracy, and power cost per accurate bit of the SSH variable for different vector sizes. Additionally, the experiments included shuffling the values within the dot-products 1000 times to observe the spread of the SSH variable and ensure reproducibility. The correct significant bits were measured once the output was rounded to IEEE 754 double-precision to ensure a fair comparison .

What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the ImageNet dataset . The code used in the study is open source, and the framework presented is an open-source software/hardware co-designed framework .

Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide strong support for the scientific hypotheses that need to be verified. The study focuses on the effectiveness of hardware units, specifically comparing Fused Dot Products (FDPs) to double- and quad-precision FMAs in computational systems, rather than algorithmic trade-offs . The research demonstrates that the FDPs enhance SSH computation accuracy significantly compared to IEEE754-64 and IEEE754-128, resulting in substantial improvements in accuracy per power cost . Additionally, the paper introduces an open-source software/hardware co-designed framework that allows for intuitive intermediate precision adjustments in high-end software code, enabling accuracy and energy tradeoffs without requiring code modifications . These findings highlight the potential of tailored-precision accumulators for various HPC workloads, emphasizing the importance of numerically tailored accumulators for reproducibility in scientific computing applications .

What are the contributions of this paper?

The paper provides several contributions, including:

Reflections on 10 years of FloPoCo
Designing custom arithmetic data paths with FloPoCo
An FPGA-specific approach to floating-point accumulation and sum-of-products
A study of BFLOAT16
Evaluating the Numerical Stability of Posit Arithmetic
Parameterized Posit Arithmetic Hardware Generator
A matrix-multiply unit for posits in reconfigurable logic leveraging (open)CAPI
Evaluating the Hardware Cost of the Posit Number System
PERI: A Configurable Posit Enabled RISC-V Core
CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism

What work can be continued in depth?

To delve deeper into the topic, further research can be conducted on the following aspects:

Exploring the trade-offs between accuracy and energy consumption in designs for different scenarios, particularly focusing on accumulator/arithmetic combinations for real HPC workloads with diverse numerical requirements .
Investigating the Sea Surface Height (SSH) computation in ocean circulation model development to monitor ocean currents, eddies, and climate changes, emphasizing the importance of precision in calculations and the impact of different arithmetic formats on accuracy .
Evaluating the numerical stability of Posit Arithmetic and its implications, especially in terms of energy consumption and accuracy, to further understand the benefits and challenges of using Posit arithmetic in computational systems .

Introduction

Background

Evolution of HPC and the need for efficient MMMs

Importance of custom dot-product operators and data formats

Objective

To develop an open-source framework for energy-efficient MMMs

Highlight the benefits of workload-adaptive accumulators in FPGAs

Methodology

Data Collection

Benchmarking existing HPC systems and libraries

Gathering performance data for different precision levels

Data Preprocessing

Conversion and normalization of data formats (Posits, IEEE754, Bfloat16)

Development of custom dot-product kernels

Custom Arithmetic Datapath

Design and implementation of low-precision arithmetic units

Evaluation of performance and energy efficiency

Programming Language Integration

Seamless API for integration with popular languages (C++, Python, etc.)

Portability and ease of use for developers

Case Studies

AI Inference (ResNet50)

Energy reduction achieved with the framework (up to 3.3x)

Impact on model accuracy and inference speed

Secure Hashing (SSH)

Accuracy improvement with workload-adaptive accumulators (up to 15.1x)

Effect on computation time and security

Precision and Format Analysis

Trade-offs between accuracy and power consumption

Selection of optimal precision based on workload characteristics

Results and Evaluation

Performance comparison with state-of-the-art methods

Energy efficiency improvements and savings

Impact on overall system design and scalability

Conclusion

Contributions of the open-source framework

Future research directions for low-precision accumulators in HPC

Call for collaboration and community involvement

Acknowledgments

Grants and funding sources supporting the research

Collaboration with industry partners and open-source community

Basic info

papers

numerical analysis

hardware architecture

machine learning

mathematical software

artificial intelligence

Advanced features

Insights

What is the primary focus of the paper's open-source framework?

What are the improvements shown in AI inference and SSH computation using the framework?

How does the framework optimize energy efficiency in Matrix-Matrix Multiplications (MMMs)?

What does the study emphasize when it comes to selecting precision levels for workload-adaptive accumulators?