UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Sitian Chen, Haobin Tan, Amelie Chi Zhou, Yusen Li, Pavan Balaji·June 20, 2024

Summary

The paper introduces UpDLRM, a Deep Learning Recommendation Model (DLRM) accelerator that leverages UPMEM DPU, a real-world Processing-In-Memory (PIM) hardware, to enhance performance. By offloading memory-bound embedding operations to the DPU, UpDLRM reduces CPU memory contention, leading to faster inference times. Key contributions include EMT partitioning, cache-aware strategies, and workload balancing based on item popularity, addressing challenges like limited DPU memory banks and inter-DPU communication. Experiments with six real-world datasets demonstrate UpDLRM's superiority over CPU-only and hybrid architectures, achieving speedups of up to 4.6x, with performance varying depending on dataset access patterns and cache partitioning. The study also explores scalability and differentiates UpDLRM from related work, envisioning future research in DPU-GPU heterogeneous systems.

Key findings

8

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the workload imbalance problem introduced by data caching in Deep Learning Recommendation Models (DLRMs) by proposing a cache-aware non-uniform partitioning method to balance memory accesses on cache storage and regular storage . This problem is not entirely new, as caching techniques have been used to reduce DLRM inference latency, but applying them to DPU-based DLRMs exacerbates the workload imbalance among DPUs . The paper introduces a solution to this issue by proposing cache-aware non-uniform partitioning to optimize memory traffic and accelerate DLRM inference .


What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis that utilizing real-world processing-in-memory (PIM) hardware, specifically the UPMEM DPU, can enhance memory bandwidth and reduce recommendation latency in Deep Learning Recommendation Models (DLRMs) . The study focuses on optimizing the inference time of DLRM systems by leveraging the parallel nature of DPU memory to provide high aggregated bandwidth for irregular memory accesses in embedding lookups, potentially reducing inference latency significantly . Additionally, the paper explores the embedding table partitioning problem to achieve workload balance and efficient data caching, aiming to improve the overall performance of DLRMs compared to CPU-only and CPU-GPU hybrid counterparts .


What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" proposes several innovative ideas, methods, and models to optimize the inference time of Deep Learning Recommendation Models (DLRMs) .

  1. Utilization of Real-World Processing-in-Memory (PIM) Hardware: The paper introduces UpDLRM, which leverages the UPMEM DPU hardware to enhance memory bandwidth and reduce recommendation latency . By utilizing the parallel nature of DPU memory, UpDLRM aims to provide high aggregated bandwidth for embedding lookups, potentially reducing inference latency significantly.

  2. Cache-Aware Partitioning Method: To address workload imbalance and optimize data caching, the paper proposes a cache-aware partitioning method that balances memory accesses on cache storage and regular EMT storage . This method aims to improve the efficiency of data caching and achieve good workload balance, leading to lower inference times for DLRMs.

  3. EMT Partitioning Strategies: The paper studies the EMT partitioning problem at different levels to enhance the efficiency of DPU-supported embedding operations . It considers hardware features, workload balance, and data access frequencies to optimize memory bandwidth utilization and minimize embedding layer processing time.

  4. Performance Evaluation and Sensitivity Study: The paper conducts performance evaluations and sensitivity studies to analyze the effectiveness of UpDLRM under various scenarios . It observes better performance for datasets with higher average reduction and studies the impact of different partitioning methods on reducing the embedding lookup time in UpDLRM.

In summary, the paper introduces innovative approaches such as utilizing PIM hardware, cache-aware partitioning, and EMT partitioning strategies to optimize the inference time of DLRMs, demonstrating significant speedups compared to traditional CPU-only and CPU-GPU hybrid counterparts . The paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" introduces several key characteristics and advantages compared to previous methods, as detailed in the document .

  1. Cache-Aware Partitioning Method: UpDLRM proposes a cache-aware partitioning method that effectively reduces the ratio of DPU lookup time in the embedding time, mitigating the bottleneck effect of embedding operations by enhancing partial sum caching and cache-aware partitioning .

  2. Efficient Memory Utilization: By offloading memory-bound embedding operations to DPUs and leveraging the parallel nature of DPU memory banks, UpDLRM reduces resource contention on CPU memory bandwidth, accelerates inference time, and efficiently processes multiple embedding lookups simultaneously .

  3. Performance Improvement: UpDLRM achieves up to a 4.6x speedup in inference performance compared to CPU-only and CPU-GPU hybrid counterparts, demonstrating significant enhancements in recommendation latency .

  4. Workload-Balance and Data Caching: The paper addresses the workload imbalance issue among DPUs by proposing cache-aware non-uniform partitioning, which effectively reduces embedding lookup time by up to 26% compared to methods without caching, ensuring efficient data caching and workload balance .

  5. Scalability and Latency Reduction: UpDLRM demonstrates scalability with the increase of average reduction frequency, showcasing linear increases in DPU lookup time with higher reduction frequencies and reduced lookup time with larger per lookup data sizes, contributing to lower latency and improved performance .

  6. Inference Time Optimization: Through the utilization of real-world PIM hardware, specifically UPMEM DPU, UpDLRM boosts memory bandwidth, reduces recommendation latency, and achieves superior speedup performance compared to traditional methods, especially in datasets with higher average reduction .

In summary, UpDLRM's cache-aware partitioning, efficient memory utilization, performance improvements, workload-balance strategies, scalability, and latency reduction contribute to its significant advantages over previous methods, leading to enhanced inference performance and reduced recommendation latency in personalized recommendation systems.


Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research works exist in the field of accelerating personalized recommendation systems using real-world processing-in-memory (PIM) architecture. Noteworthy researchers in this field include Muhammad Adnan et al. , Paul Covington et al. , Udit Gupta et al. , Liu Ke et al. , Heewoo Kim et al. , Daniar H Kurniawan et al. , Maxim Naumov et al. , Jianmo Ni et al. , Jérémie Rappaz et al. , Mengting Wan et al. , and Tao Yang et al. .

The key to the solution mentioned in the paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" is the utilization of real-world processing-in-memory (PIM) hardware, specifically the UPMEM DPU, to boost memory bandwidth and reduce recommendation latency. By storing large embedding tables (EMTs) using DPU memory and performing memory lookups and reductions using DPUs, the design aims to reduce resource contention on CPU memory bandwidth, accelerate inference time, and efficiently process multiple embedding lookups and reductions simultaneously . The paper also addresses the challenge of workload imbalance by proposing cache-aware partitioning methods that balance memory accesses on cache storage and regular EMT storage, optimizing inference time and achieving up to a 4.6x speedup compared to other architectures .


How were the experiments in the paper designed?

The experiments in the paper were designed by adopting Meta's deep learning recommendation model (DLRM) with six real-world datasets, categorized into low hot, medium hot, and high hot based on the average reduction frequency of the dataset. Each dataset was duplicated to form eight embedding memory tables (EMTs) with each embedding vector having 32 dimensions. Inference performance was measured by conducting a sampling of 12,800 inferences in each set of experiments with a batch size set to 64 . The paper compared the UpDLRM model with three other open-source DLRM implementations using different hardware architectures . The experiments aimed to optimize the inference time of DLRM systems by evaluating the effectiveness of UpDLRM in reducing the inference latency and achieving up to a 4.6x speedup in terms of inference performance compared to other counterparts .


What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the Goodreads dataset . The code for the Deep Learning Recommendation Model (DLRM) implementation is open source, as it is compared with three other open-source DLRM implementations .


What are the contributions of this paper?

The contributions of the paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" include:

  • Utilizing Real-World Processing-in-Memory (PIM) Hardware: The paper proposes the use of UPMEM DPU, a real-world PIM hardware, to enhance memory bandwidth and reduce recommendation latency .
  • Optimizing Inference Time: The study focuses on optimizing the inference time of Deep Learning Recommendation Models (DLRMs) by leveraging PIM architecture, which can provide high aggregated bandwidth for irregular memory accesses in embedding lookups, thereby reducing inference latency .
  • Workload-Balance and Efficient Data Caching: The research addresses the embedding table partitioning problem to achieve good workload-balance and efficient data caching, leading to improved inference performance compared to CPU-only and CPU-GPU hybrid counterparts .

What work can be continued in depth?

Further research in this area can focus on designing a DPU-GPU heterogeneous system to optimize the inference time of DLRM systems . This would involve exploring the integration of DPUs and GPUs to enhance the efficiency of memory-intensive operations in recommendation systems. Additionally, investigating the impact of cache-aware partitioning methods on reducing inference latency and workload balance could be a valuable direction for future studies . By delving deeper into these aspects, researchers can enhance the performance and scalability of personalized recommendation systems.

Tables

1

Introduction
Background
Overview of Deep Learning Recommendation Models (DLRM)
Importance of memory optimization in DLRMs
UPMEM DPU: A Processing-In-Memory hardware platform
Objective
To develop UpDLRM, a DLRM accelerator using UPMEM DPU
Improve performance by offloading memory-bound tasks
Address challenges with limited DPU resources
Method
Data Collection and Model Architecture
DLRM architecture overview
Integration of UPMEM DPU in the recommendation pipeline
EMT Partitioning
Explanation of Embedding Matrix Tensorization (EMT)
Partitioning strategy for efficient DPU utilization
Cache-Aware Strategies
Cache optimization techniques for DPU and CPU
Handling cache misses and data locality
Workload Balancing
Item popularity-based approach
Managing memory banks and inter-DPU communication
Experimental Setup
Real-world datasets used
Performance metrics and baselines
Experiment Results
Speedup comparison with CPU-only and hybrid architectures
Impact of dataset access patterns and cache partitioning on performance
Scalability and Evaluation
UpDLRM's scalability with increasing dataset size
Performance analysis under varying workload conditions
Differentiation from Related Work
Comparison with existing PIM-based recommendation systems
Advantages and unique features of UpDLRM
Future Research Directions
Heterogeneous systems combining DPU and GPU
Opportunities in DLRM acceleration for next-generation hardware
Conclusion
Summary of UpDLRM's contributions
Implications for real-world recommendation systems and PIM hardware adoption
Basic info
papers
information retrieval
artificial intelligence
Advanced features
Insights
How does UpDLRM offload memory-bound operations to improve CPU performance?
What is UpDLRM, and how does it enhance performance in Deep Learning Recommendation Models?
What are the key contributions of EMT partitioning, cache-aware strategies, and workload balancing in UpDLRM?
What are the speedup improvements achieved by UpDLRM compared to CPU-only and hybrid architectures, and how do they vary with dataset characteristics?

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Sitian Chen, Haobin Tan, Amelie Chi Zhou, Yusen Li, Pavan Balaji·June 20, 2024

Summary

The paper introduces UpDLRM, a Deep Learning Recommendation Model (DLRM) accelerator that leverages UPMEM DPU, a real-world Processing-In-Memory (PIM) hardware, to enhance performance. By offloading memory-bound embedding operations to the DPU, UpDLRM reduces CPU memory contention, leading to faster inference times. Key contributions include EMT partitioning, cache-aware strategies, and workload balancing based on item popularity, addressing challenges like limited DPU memory banks and inter-DPU communication. Experiments with six real-world datasets demonstrate UpDLRM's superiority over CPU-only and hybrid architectures, achieving speedups of up to 4.6x, with performance varying depending on dataset access patterns and cache partitioning. The study also explores scalability and differentiates UpDLRM from related work, envisioning future research in DPU-GPU heterogeneous systems.
Mind map
Impact of dataset access patterns and cache partitioning on performance
Speedup comparison with CPU-only and hybrid architectures
Advantages and unique features of UpDLRM
Comparison with existing PIM-based recommendation systems
Experiment Results
Managing memory banks and inter-DPU communication
Item popularity-based approach
Handling cache misses and data locality
Cache optimization techniques for DPU and CPU
Partitioning strategy for efficient DPU utilization
Explanation of Embedding Matrix Tensorization (EMT)
Integration of UPMEM DPU in the recommendation pipeline
DLRM architecture overview
Address challenges with limited DPU resources
Improve performance by offloading memory-bound tasks
To develop UpDLRM, a DLRM accelerator using UPMEM DPU
UPMEM DPU: A Processing-In-Memory hardware platform
Importance of memory optimization in DLRMs
Overview of Deep Learning Recommendation Models (DLRM)
Implications for real-world recommendation systems and PIM hardware adoption
Summary of UpDLRM's contributions
Opportunities in DLRM acceleration for next-generation hardware
Heterogeneous systems combining DPU and GPU
Differentiation from Related Work
Experimental Setup
Workload Balancing
Cache-Aware Strategies
EMT Partitioning
Data Collection and Model Architecture
Objective
Background
Conclusion
Future Research Directions
Scalability and Evaluation
Method
Introduction
Outline
Introduction
Background
Overview of Deep Learning Recommendation Models (DLRM)
Importance of memory optimization in DLRMs
UPMEM DPU: A Processing-In-Memory hardware platform
Objective
To develop UpDLRM, a DLRM accelerator using UPMEM DPU
Improve performance by offloading memory-bound tasks
Address challenges with limited DPU resources
Method
Data Collection and Model Architecture
DLRM architecture overview
Integration of UPMEM DPU in the recommendation pipeline
EMT Partitioning
Explanation of Embedding Matrix Tensorization (EMT)
Partitioning strategy for efficient DPU utilization
Cache-Aware Strategies
Cache optimization techniques for DPU and CPU
Handling cache misses and data locality
Workload Balancing
Item popularity-based approach
Managing memory banks and inter-DPU communication
Experimental Setup
Real-world datasets used
Performance metrics and baselines
Experiment Results
Speedup comparison with CPU-only and hybrid architectures
Impact of dataset access patterns and cache partitioning on performance
Scalability and Evaluation
UpDLRM's scalability with increasing dataset size
Performance analysis under varying workload conditions
Differentiation from Related Work
Comparison with existing PIM-based recommendation systems
Advantages and unique features of UpDLRM
Future Research Directions
Heterogeneous systems combining DPU and GPU
Opportunities in DLRM acceleration for next-generation hardware
Conclusion
Summary of UpDLRM's contributions
Implications for real-world recommendation systems and PIM hardware adoption
Key findings
8

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the workload imbalance problem introduced by data caching in Deep Learning Recommendation Models (DLRMs) by proposing a cache-aware non-uniform partitioning method to balance memory accesses on cache storage and regular storage . This problem is not entirely new, as caching techniques have been used to reduce DLRM inference latency, but applying them to DPU-based DLRMs exacerbates the workload imbalance among DPUs . The paper introduces a solution to this issue by proposing cache-aware non-uniform partitioning to optimize memory traffic and accelerate DLRM inference .


What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis that utilizing real-world processing-in-memory (PIM) hardware, specifically the UPMEM DPU, can enhance memory bandwidth and reduce recommendation latency in Deep Learning Recommendation Models (DLRMs) . The study focuses on optimizing the inference time of DLRM systems by leveraging the parallel nature of DPU memory to provide high aggregated bandwidth for irregular memory accesses in embedding lookups, potentially reducing inference latency significantly . Additionally, the paper explores the embedding table partitioning problem to achieve workload balance and efficient data caching, aiming to improve the overall performance of DLRMs compared to CPU-only and CPU-GPU hybrid counterparts .


What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" proposes several innovative ideas, methods, and models to optimize the inference time of Deep Learning Recommendation Models (DLRMs) .

  1. Utilization of Real-World Processing-in-Memory (PIM) Hardware: The paper introduces UpDLRM, which leverages the UPMEM DPU hardware to enhance memory bandwidth and reduce recommendation latency . By utilizing the parallel nature of DPU memory, UpDLRM aims to provide high aggregated bandwidth for embedding lookups, potentially reducing inference latency significantly.

  2. Cache-Aware Partitioning Method: To address workload imbalance and optimize data caching, the paper proposes a cache-aware partitioning method that balances memory accesses on cache storage and regular EMT storage . This method aims to improve the efficiency of data caching and achieve good workload balance, leading to lower inference times for DLRMs.

  3. EMT Partitioning Strategies: The paper studies the EMT partitioning problem at different levels to enhance the efficiency of DPU-supported embedding operations . It considers hardware features, workload balance, and data access frequencies to optimize memory bandwidth utilization and minimize embedding layer processing time.

  4. Performance Evaluation and Sensitivity Study: The paper conducts performance evaluations and sensitivity studies to analyze the effectiveness of UpDLRM under various scenarios . It observes better performance for datasets with higher average reduction and studies the impact of different partitioning methods on reducing the embedding lookup time in UpDLRM.

In summary, the paper introduces innovative approaches such as utilizing PIM hardware, cache-aware partitioning, and EMT partitioning strategies to optimize the inference time of DLRMs, demonstrating significant speedups compared to traditional CPU-only and CPU-GPU hybrid counterparts . The paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" introduces several key characteristics and advantages compared to previous methods, as detailed in the document .

  1. Cache-Aware Partitioning Method: UpDLRM proposes a cache-aware partitioning method that effectively reduces the ratio of DPU lookup time in the embedding time, mitigating the bottleneck effect of embedding operations by enhancing partial sum caching and cache-aware partitioning .

  2. Efficient Memory Utilization: By offloading memory-bound embedding operations to DPUs and leveraging the parallel nature of DPU memory banks, UpDLRM reduces resource contention on CPU memory bandwidth, accelerates inference time, and efficiently processes multiple embedding lookups simultaneously .

  3. Performance Improvement: UpDLRM achieves up to a 4.6x speedup in inference performance compared to CPU-only and CPU-GPU hybrid counterparts, demonstrating significant enhancements in recommendation latency .

  4. Workload-Balance and Data Caching: The paper addresses the workload imbalance issue among DPUs by proposing cache-aware non-uniform partitioning, which effectively reduces embedding lookup time by up to 26% compared to methods without caching, ensuring efficient data caching and workload balance .

  5. Scalability and Latency Reduction: UpDLRM demonstrates scalability with the increase of average reduction frequency, showcasing linear increases in DPU lookup time with higher reduction frequencies and reduced lookup time with larger per lookup data sizes, contributing to lower latency and improved performance .

  6. Inference Time Optimization: Through the utilization of real-world PIM hardware, specifically UPMEM DPU, UpDLRM boosts memory bandwidth, reduces recommendation latency, and achieves superior speedup performance compared to traditional methods, especially in datasets with higher average reduction .

In summary, UpDLRM's cache-aware partitioning, efficient memory utilization, performance improvements, workload-balance strategies, scalability, and latency reduction contribute to its significant advantages over previous methods, leading to enhanced inference performance and reduced recommendation latency in personalized recommendation systems.


Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research works exist in the field of accelerating personalized recommendation systems using real-world processing-in-memory (PIM) architecture. Noteworthy researchers in this field include Muhammad Adnan et al. , Paul Covington et al. , Udit Gupta et al. , Liu Ke et al. , Heewoo Kim et al. , Daniar H Kurniawan et al. , Maxim Naumov et al. , Jianmo Ni et al. , Jérémie Rappaz et al. , Mengting Wan et al. , and Tao Yang et al. .

The key to the solution mentioned in the paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" is the utilization of real-world processing-in-memory (PIM) hardware, specifically the UPMEM DPU, to boost memory bandwidth and reduce recommendation latency. By storing large embedding tables (EMTs) using DPU memory and performing memory lookups and reductions using DPUs, the design aims to reduce resource contention on CPU memory bandwidth, accelerate inference time, and efficiently process multiple embedding lookups and reductions simultaneously . The paper also addresses the challenge of workload imbalance by proposing cache-aware partitioning methods that balance memory accesses on cache storage and regular EMT storage, optimizing inference time and achieving up to a 4.6x speedup compared to other architectures .


How were the experiments in the paper designed?

The experiments in the paper were designed by adopting Meta's deep learning recommendation model (DLRM) with six real-world datasets, categorized into low hot, medium hot, and high hot based on the average reduction frequency of the dataset. Each dataset was duplicated to form eight embedding memory tables (EMTs) with each embedding vector having 32 dimensions. Inference performance was measured by conducting a sampling of 12,800 inferences in each set of experiments with a batch size set to 64 . The paper compared the UpDLRM model with three other open-source DLRM implementations using different hardware architectures . The experiments aimed to optimize the inference time of DLRM systems by evaluating the effectiveness of UpDLRM in reducing the inference latency and achieving up to a 4.6x speedup in terms of inference performance compared to other counterparts .


What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the Goodreads dataset . The code for the Deep Learning Recommendation Model (DLRM) implementation is open source, as it is compared with three other open-source DLRM implementations .


What are the contributions of this paper?

The contributions of the paper "UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture" include:

  • Utilizing Real-World Processing-in-Memory (PIM) Hardware: The paper proposes the use of UPMEM DPU, a real-world PIM hardware, to enhance memory bandwidth and reduce recommendation latency .
  • Optimizing Inference Time: The study focuses on optimizing the inference time of Deep Learning Recommendation Models (DLRMs) by leveraging PIM architecture, which can provide high aggregated bandwidth for irregular memory accesses in embedding lookups, thereby reducing inference latency .
  • Workload-Balance and Efficient Data Caching: The research addresses the embedding table partitioning problem to achieve good workload-balance and efficient data caching, leading to improved inference performance compared to CPU-only and CPU-GPU hybrid counterparts .

What work can be continued in depth?

Further research in this area can focus on designing a DPU-GPU heterogeneous system to optimize the inference time of DLRM systems . This would involve exploring the integration of DPUs and GPUs to enhance the efficiency of memory-intensive operations in recommendation systems. Additionally, investigating the impact of cache-aware partitioning methods on reducing inference latency and workload balance could be a valuable direction for future studies . By delving deeper into these aspects, researchers can enhance the performance and scalability of personalized recommendation systems.

Tables
1
Scan the QR code to ask more questions about the paper
© 2025 Powerdrill. All rights reserved.