HPE Proliant XL675D Gen10 PLUS GPU Server + Apollo 6500 Chassis Review

October 16, 2024 0 By Lorena Mejia

The HPE Proliant XL675D Gen10 PLUS GPU Server (SHOP HERE) is somewhat of an older system but it is still quite powerful and relevant. This server is housed in the HPE Apollo 6500 Chassis and is designed for the exascale era. Exascale, refers to Floating Point Operations Per Second, or FLOPS, which in this case is a “1” followed by 18x zeros. That said, the actual phrasing they used is “exascale era” so take from that what you will. Purpose-built to support either NVIDIA NVLink technology or AMD’s Infinity Fabric, it can be outfitted with a number of GPU options. Either SXM4 or PCIe form factors. Let’s take a look at this 6U beast.

This 6U system can tackle AI, Machine Learning, Deep Learning, and other highly complex high-performance computing tasks with ease. This is a configure-to-order system. There are 3 main parts to this system. The Apollo 6500 chassis, the dual socket HPE ProLiant XL675D Gen10 PLUS compute node, and the GPU tray. The Apollo 6500 Chassis can also be configured with 2x single-socket HPE ProLiant XL645d Gen10 PLUS server nodes with a single AMD EPYC 7H12 64-core CPU or just the single XL675d Gen10 PLUS we have here. This XL675d can support either 8x NVIDIA HGX A100 80GB GPUs or AMD Instinct GPUs with 8-10 double-wide or 16x single-wide GPUs. Whereas the XL645d offers a single socket and supports either 4x NVIDIA HGX A100 40GB GPUs or AMD Instinct PCIe options with either 4x doublewide GPUs or 8 single-wide GPUs connected via AMD Infinity Fabric.

Potentially, you can install 8x 2.5-inch SAS, SATA or NVMe drives in each of the two drive cages for up to 16x drives total. Drive box 1 on the left and drive box 2 on the right. If this was configured for the XL645d half-width node then those two drive boxes would be divided between NODE 1 and NODE 2, respectively. A control panel for each of the server nodes is useful when you have two nodes. In this case, we just have the one HP ProLiant XL675d Gen10 PLUS node. Each of the control panels has a power ON button with integrated LED, Health LED, NIC status LED and Unit ID button LED. With our current setup with the full-width XL675d server node, only the right control panel is active. Once you remove a few screws on top, pressing the large button in the center allows the front panel to tilt down exposing the drive backplane, expansion boards, and a bank of 15x hot swap fans with redundancy.

On the back of the system, that upper portion has the GPU tray which, depending on your needs can be either NVIDIA or AMD PCIe-based GPUs or SXM4 GPUs. In this case we have NVIDIA HGX A100 SXM4 GPUS with 80GB of memory each connected by NVIDIA’s NVLink Technology. There are 8x of them behind a simple perforated metal panel for air flow. While this system does come built-to-order, given the list of approved customer repairs it does appear that you can switch out the GPU tray for something else if need be at a later date. Other GPU trays can be purchased offering either 8-10 double-wide PCIe cards or 16x single-wide cards. The PCIe GPU tray definitely has a different look with all of the PCIe slots in back. If you were to order the half-width XL645d there would be two separate GPU trays, each with either 4x NVIDIA HGX A100 SXM4 GPUs, 4x double-wide GPUs or up to 8x single-width GPUs from AMDs Instinct performance HPC AI GPU line up. That would be per GPU tray. Once you release the thumb screws on the levers, you can use the levers to pull out the GPU trays. In this case singular GPU tray. The compute trays are secured in a similar fashion with thumb screws holding the release levers in place.

A bank of 6x power supply units are just below the GPU tray offering 4+2 redundancy. Beside that, an APM 2.0 connector, short for the Apollo Platform Manager 2.0. The APM allows Administrators to manage HPE Apollo and Moonshot shared rack infrastructures, the actual systems like this XL675d or the XL645d, and rack level power settings from a single user interface. Integrated Light’s Out or iLO is embedded on the system board of the XL675d and also helps to manage the system both in-band and out-of-band. Power capping can also be established using iLO. Other management options include HPE OneView and the HPE Performance Cluster Manager or HPCM. Both are designed to manage large clusters or groups of servers in the data center.

Below all of that is the ProLiant XL675d Gen 10 PLUS compute node. Starting on the left, a 1Gb Ethernet port with dual USB 3.1 Gen1 connectors, a video connector, and optional serial port. There is also an OCP 3.0 card slot right in the middle of the chassis, which in this case is used for a pass-through module. Using the levers to pull the compute node out of the XL6500 Chassis, you can see two release levers on top of the unit.

Once you engage the release levers, the case separates like a clam shell with PCIe switchboard on top and the CPU, memory tray on the bottom. The PCIe switchboard, has 5x PCIe switches, plus SAS and NVMe SlimSAS connectors for drive support. The top is also where the risers plug in for NICs and HD controllers. For network controllers there is a choice of high-speed Ethernet, InfiniBand or HPE Slingshot. The Slingshot option is a 1U switch that offers a high-bandwidth, low-latency solution to optimize performance on converged infrastructures for HPC, machine learning, and analytics applications and can scale to suite your needs.

The lower portion of the clam shell supports the memory module slots and CPUs, plus 4x NVMe SlimSAS connectors, each with a x8 connector width for supporting 2x NVMe drives each. The compute unit and the GPU tray are connected to each other through the midplane assembly in the chassis just behind that bank of 15 fans. This double-wide unit has 4x x16 PCIe 4.0 low-profile slots for the primary and secondary risers, and 2x x16 full-height half-length PCIe 4.0 slots in the tertiary riser. Those 4x x16 low-profile slots can be used to support a balanced I/O fabric, while the other slots in the tertiary riser can be used for HPE Smart controllers or the boot device if you want to preserve all of your upfront storage. There is also a place on the system board for a module HPE Smart Controller like we have on this system, which has 2x slimSAS connectors for 4x drives each.

An OS boot device can be installed in one of the tertiary riser PCIe slots, and offers dual 480GB NVMe M.2 drives. Each of the slots in the tertiary riser are PCIe 4.0 x16 with a x8 connector link width. The boot device utilizes a PCIe 3.0 interface and requires no GUI or setup, automatically creating a hardware RAID 1 configuration for mirroring the two drives.

The configuration for the front 2.5-inch drive bays is also dependent on the installed controllers with the embedded controller capable of supporting up to 8x SATA drives or 7x 4x NVMe drives. For support of NVMe drives, you will need the NVMe U.3 premium backplane. The other configurations with SAS or SATA use the Standard backplane option, like we have on this system. HPE Smart Array controllers, either the E208i-a or the P408i-a, will support up to 8x SAS or SATA drives in a hardware RAID. For the 8+8 bay configuration with SATA and SAS drives, there’s the P816i-a controller but only that last option will take up both of the front drive cages and is more for the single server node option than for the dual node server configuration.

Direct liquid cooling is an option on this system promising increased efficiency and power density. In fact, GPUs with a TDP of up to 500W are supported along with CPUs of up to 280W TDP so effective cooling is a necessity. There are other options for GPUs from NVIDIA, AMD, and others, which are not necessarily mentioned.

In its current configuration, this XL675d is outfitted with 2x 3^rd gen AMD EPYC 75F3 CPUs with 32 cores each, 1TB of memory, and 8x NVIDIA HGX A100 80GB SXM4 GPUs. Pre-configured CPU options for this model are the AMD EPYC 7H12 with 64-cores per socket or the AMD EPYC 7532 32-core CPU. Although, that said it can support a number of other options from both the 2^nd generation ROME and 3^rd generation Milan. The NVIDIA HGX A100 80GB GPUs feature Tensor cores and can be partitioned into 7 separate GPU instances and features a memory bandwidth of over 2TB/s. At full capacity, it can support up to 4TB of memory using Registered 128GB memory modules in all 32x slots.

With extensive GPU support, this system can handle the most complex AI models, Machine or Deep Learning inference scenarios, and a myriad of high-performance computing simulations. It is built to order but there are quite a few different configurations as outlined above depending on your workload. This system has a price tag that will make your eyes water but for certain applications, this is what you want/need.