Title: Data Placement on Heterogeneous Memory Architectures


Speaker: Mohammad Laghari


Time: August 6th, 2018, 10:30 AM


Place: SOS Z21

Koç University

Rumeli Feneri Yolu

Sariyer, Istanbul


Thesis Committee Members:

Asst. Prof. Dr. Didem Unat (Advisor, Koç University)

Prof. Dr. Yücel Yemez (Koç University)

Asst. Prof. Dr. Ayşe Yılmazer (İstanbul Technical University)



Memory bandwidth has long been the limiting scaling factor for high performance applications. To overcome this limitation, various heterogeneous memory systems have emerged. A heterogeneous memory system is equipped with multiple memories each with distinct characteristics. Among other characteristics, one common characteristic across these systems includes a memory with a significantly higher bandwidth than the others. This particular memory is known as the high bandwidth memory in general. Some of such high bandwidth memory (HBM) technologies include the hybrid memory cube (HMC) by Micron and high bandwidth memory standard by JEDEC. Intel’s latest Xeon Phi processor, namely Intel Knights Landing (KNL), is equipped with an HBM known as the multi-channel DRAM, or MCDRAM, along with a DDR. The MCDRAM boasts up to 450 GB/s memory bandwidth as compared to its slower counterpart DDR which boasts only up to 88 GB/s. Unfortunately, as the bandwidth increases, the access latency for MCDRAM increases. Due to technology limitations and the high price per byte rate, HBM is offered in a small capacity as compared to traditional DDR in heterogeneous memory systems. Therefore, to overcome the smaller capacity of HBM, a heterogeneous memory system is also equipped with a higher capacity DDR. In such systems, the programmer is offered a choice to perform explicit allocations to each memory or let hardware handle data caching to the HBM. An intelligent object allocation scheme can yield a performance boost of the application. On the contrary, if allocation is made without considering memory and application characteristics the overall performance of an application can drastically degrade. The object allocation choice coupled with the choice of deciding a system configuration can overburden the programmer resulting in increased programming effort and time consumption.


This thesis presents an object allocation scheme which is based on a system and application specific cost model, a combinatorics optimization algorithm commonly known as the 0/1 Knapsack and a tool which combines these two components in practice. The cost model considers various characteristics of application data such as the object sizes, memory access counts, type of access, etc. In addition to application characteristics, the cost model also considers memory bandwidth under various conditions, including data streaming bandwidth and data copy bandwidth. These characteristics make our cost model rich allowing it to suggest an intelligent object allocation scheme. Using the aforementioned characteristics, the cost model determines a score for each object. These scores are used as values for each object in the 0/1 Knapsack algorithm to determine objects to be allocated on HBM where Knapsack size is HBM size. The tool uses the cost model to make an intelligent decision for object placement. The tool comes in two flavors: 1) static placement where object placement is decided in the beginning of application execution and 2) dynamic placement where objects are evicted and admitted to high bandwidth memory on the run based on application phases thus incurring the object movement cost. In the latter variant, the cost model considers the movement cost of objects from one memory to another while deciding on objects to be placed on the HBM. The tool is also capable of conducting object transfers asynchronously. The asynchronous transfers allow the tool to hide the transfer cost between phases.


We evaluate our allocation scheme using a diverse set of applications from NAS Parallel and Rodinia benchmark suites. The included applications have varying workloads and memory access patterns which exhibit the characteristics of real world applications. During the evaluation, Intel’s Knights Landing and its high bandwidth memory, namely MCDRAM, was used. The object placement suggested by the tool yields a speedup of up to 2.5x. We observe that latency-sensitive applications fail to benefit from high bandwidth memory allocation. This is because of the higher access latency of HBM. We also compared the results with the automatic hardware caching of Intel KNL. In hardware mode, the application is executed on the system without any changes and the caching is done by hardware automatically. We observe that our allocation scheme yields result better than that of hardware caching majority of the time.