The NIC5 HPC cluster hardware

The NIC5 CPUs

All NIC5 CPUs are 32 cores AMD EPYC 7542. These CPUs are from the "Rome" also known as "Zen 2" generation of processors. Rome processors use a multi-chip-module (MCM) design with multiple dies combined on one package. These dies can be categorized in two different types:

one more or more Core Complex Dies (CCDs) which contains the cores
an I/O die which is a grid that connects the CCDs among each other and to external components

Recent AMD CPUs are organized Core Complex Dies (CCDs). These CCDs are comprised of eight cores, grouped into two Core Complexes (CCX). These core complexes are composed of four cores that collectively share 16 MB of L3 cache (4 x 4 MB slices). Each of these cores is equipped with 32 KB of L1 data cache (L1d) and 32 KB of L1 instruction cache (L1i). The subsequent level of cache consists of a 512 KB L2 data cache private to the cores. The guarantee clock if the cores are correctly cooled is 2.9 GHz.

Zen 2 cores have support for AVX2 vector instructions, with two 256-bit Fused Multiply-Add (FMA) units. These units are capable of delivering 16 double-precision floating-point operations (Flops) per cycle.

Each CCD incorporates an Infinity Fabric on Package (IFOP) interface, linking the two CCXs to the I/O die. Furthermore, the I/O die serves as the connection point between the CCD and two DDR4 memory controllers.

NIC5 CCD — Overview of the core and Core Complex Die of the NIC5 CPUs

More details about the cache hierarchy

L1 data cache: 64 bytes cache line with an access latency of 7–8 cycles for floating point and 4–5 cycles for integer. The bandwidth is 2 x 256 bits/cycle load to registers and 1 x 256 bits/cycle store from registers.
L2 cache: 64 bytes cache line with a latency >= 12 cycles. The bandwidth is 1 x 256 bits/cycle load to L1 cache and 1 x 256 bits/cycle store from L1 cache.
L3 cache: 64 bytes cache lines with an access latency of 39 cycles on average.

If a core misses in its local L2 and also in the L3 cache, but the data resides in another L2 within the CCX, then a cache-to-cache transfer is initiated. The bandwith for this transfer is 1 x 256 bits/cycle load and 1 x 256 bits/cycle.

The AMD EPYC 7542 processor has 4 CCDs connected to the I/O die using Infinity Fabric to form a processor with 32 core (4 x 8 cores CCD).

As mentioned before, each GCD has two memory controllers. This means that the complete CPU has eight memory channels. It means that if DDR4-3200 memory is used, the theoretical total memory bandwidth is

3200 MT/s x 8 bytes/transfer x 8 channels = 204.8 GB/s

In terms of theoretical peak floating-point operations, it has been mentioned earlier that each core can execute 16 Flops per clock. This results in a total theoretical performance for the 32 cores of

2 units x 4 Flops (AVX) x 2 Flops (FMA) x 2.9 GHz x 32 cores = 1.485 TFlops

What is Infinity Fabric?

Infinity Fabric (IF) stands as AMD's proprietary interconnect used for the transmission of data and control signals among interconnected components. This technology is used across all AMD's latest microarchitectures, spanning both CPUs and GPUs. It plays a role in establishing connections within CPUs, particularly in linking Core Complex Dies (CCDs), and is in use in AMD GPUs that adopt a multi-chip-module design, such as the MI250x GPU.

Beyond its use within a single package, Infinity Fabric serves as the bridge to establish connections across different packages. It fulfills the role of interconnecting sockets within multi-socket systems (CPU-CPU), while also, on some platform, acting as the link between CPUs and GPUs and between GPUs on.

Server CPUs, especially those designed for High-Performance Computing applications, are different from consumer-grade CPUs. Beyond cores count, one of the primary differences lies in the clock speeds, with server CPUs typically operating at lower clock frequencies than their consumer counterparts. For instance, consider an 8-core laptop CPU, which has a base clock of 3.2 GHz and a maximum boost clock of 4.4 GHz. The single-threaded performance of this laptop CPU will outperform the single-threaded of NIC5 because of the higher clock boost.

However, in scenarios involving highly parallel workloads, the laptop CPU encounters thermal limitations, triggering throttling. In contrast, NIC5 CPUs are capable of handling such workloads at their base clock.

These considerations underscore the critical significance of code parallelization for harnessing the full potential of an HPC cluster.

The NIC5 compute nodes

The compute nodes of NIC5 feature two sockets, with each socket equipped with an AMD EPYC 7542 CPU. This configuration results in a total of 64 cores per compute node (2 x 32 cores).

These two CPUs are interconnected by a specialized link known as Global Memory Interconnect (GMI). The GMI consists of four 16-bit-wide links, each capable of supporting data transfers at a rate of 18 giga-transfers per second (GT/s). This arrangement provides a theoretical transfer speed of

4 links * 18 GT/s * 16 bytes/transfer * (1/8) bits/byte = 144 GB/s

Each memory channels of the standard nodes of NIC5 have two DIMMs populated with 8 GB of memory. Consequently, the total available memory is

2 sockets x 8 memory channels x 2 DIMMs x 8 GB = 256 GB

Due to the architecture of the CPUs, where each Core Complex Die is directly linked to two memory controllers, the CPU sockets are partitioned into four Non-Uniform Access Nodes (NUMA nodes). With this structure, the eight CPU cores within a CCD have quicker access to the memory within their respective node compared to accessing memory in other nodes. Memory accesses within the same node exhibit lower latency and faster performance than accessing memory in a remote node. The latency is even higher and the bandwidth even lower when a core access the memory of another CPU socket.

Each compute node is equipped with a local SSD storage capacity of 450 GiB. However, in practice, only 400 GiB can be used, as a portion of the disk is reserved for the operating system and swap partition. This local storage serves as a resource for performing file operations that are less suitable for the Parallel scratch file system.

Lastly, the compute nodes are equipped with an InfiniBand HDR100 ConnectX-6 interconnect, providing a bandwidth of 100 Gbps (12.5 GB/s). This interconnect serves for communication between compute nodes for massively parallel jobs employing the distributed memory paradigm. It also provides access to the global scratch and home file systems.

The NIC5 cluster

The NIC5 cluster features 73 compute nodes (and 1 private node). Most of these nodes (70) have 256 GB of memory. Three nodes have 1 TB of memory for jobs.

In addition, there is three additional nodes, each with a special purpose:

One login node which serves as the entry point to the cluster.
One master node which is not accessible to the user. This node serves for the management of the cluster.
One node which is the gateway to CÉCI common storage.

NIC5 cluster — Overview of the NIC5 HPC cluster

The nodes are interconnect by a fast 100 Gbps InfiniBand network with two 80 HDR100 port switches (blocking factor 1.2:1).

Details of the network topology

Switch 1 (40 nodes + 34 inter-switch):
- 36 links to compute nodes
- 2 links to the NFS servers
- 2 links to the BeeGFS servers
- 34 links to the seconds switch
Switch 2 (41 nodes + 34 inter-switch):
- 38 links to compute nodes
- 1 link to the login node
- 1 link to the master node
- 1 link to the CÉCI common storage gateway
- 34 links to the first switch

Which gives a blocking factor for communication beween nodes on differente switch of

41 nodes links / 34 inter-switch links = 1.2

In addition to the fast InfiniBand network, a 10 Gbps ethernet network is used for the management and administration of the cluster.

The NIC5 cluster has two network file systems. The use of networks file system is required as the login node and all the compute nodes need to have access to all files.

A 50 TB NFS file system which is used for the home directories (directory private to the users) and for installing software. This file system is optimized for high number of small I/O operations but not really for bandwidth (~ 200 MB/s)
A 567 TB parallel file systems based on BeeGFS to store large temporary file generated by calculations. This file system is optimized for bandwidth, providing the best performance for large I/O operations (~ 3-4 GB/s, with a maximum around 15 GB/s).