How to avoid an NVIDIA Tesla P4 to overheat

The biggest part of authoring the VDI Design Guide is research. Research on all kinds of things. Sizing, infrastructure, validation of best practices and last but not least: performance. One of my goals is to give some insights into (as I call it) The Art of the Possible. Pushing the limits of a virtual desktop is part of that and that’s why I decided to see if an NVIDIA Tesla P4 and a Xeon E5-2620 V2 are capable of running the 2016 version of Doom. To give you a heads-up, it works. But I took some hurdles to get it smoothly to work. One of those hurdles was to hammer the NVIDIA Tesla P4 in such a way that a single desktop is able to take all of its resources. As a result: my whole system crashed. But let’s go back a couple of days ago..

Years ago, I was addicted to first-person shooters like Quake, Team Fortress Classic and Half-Life Counterstrike. All of them were great games, especially when playing in clans with 8 against 8 team deathmatch or capture the flag. As the years went by, I didn’t really play any games anymore because of a different focus. That gaming focus got a little bit back when the “new” Doom was released, 1,5 years ago. A great game with a slick UI and engine, based on Vulcan.

A couple of weeks ago at the NVIDIA GRID Days in Santa Clara, I was talking with Sean Massey about how cool it would be to run a game like Doom on a VDI. Not because we should, but just because we can. And so, the research began.

To be able to run a game inside a virtual desktop, a (powerfull) GPU is required as hardware-accelerated graphics are a requirement to run games pretty smoothly. A powerfull CPU is required as well, as the virtual desktop needs to run next so some possible noisy neighbours like vCenter, vROps, NSX Managers, etc. A full list of components of my homelab can be found on this page.

I decided to use an NVIDIA Tesla P4 as the GPU because it has a 8GB framebuffer and enough cores to handle the resource requests.

I also noticed that NVIDIA did a same thing at the GTC ’18 conference this week:

 

I deployed a virtual desktop with Windows 10, 6 cores and 8GB of RAM and enabled it to use vGPU with a Grid-P4_8Q profile (which means that the full 8GB of framebuffer is allocated for the one desktop). After the first session to the virtual desktop, I configured the NVIDIA driver in the desktop to use the GPU for Game Development (as this will make sure the settings are configured as you are running games). I tested both Blast Extreme as well as PCoIP, but since my home internet connection only has 40 MBit of upstream bandwidth, I choose Blast Extreme over PCoIP since it handles bandwidth very well. But more about that in the VDI Design Guide.

Everything was set to start hammering the system. And so it happened. After starting Doom and being really impressed by the overall User Experience, the connection got dropped and when reconnecting, the virtual desktop switched from GPU to the software 3D GPU from VMware. So, the P4 had some issues. But what? I got the following message from NVIDIA-SMI on my ESXi console:

Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

The cause of the card failing turned out to be simple. The P4 has no active cooling, so it fully relies on the cooling in a host. And as my lab is build to be as silent as possible, that became the challenge. The P4 got temperatures up to 92 degrees celcius (197 degrees fahrenheit) and was ready to boil eggs.

The solution had to be found in active cooling. Unfortunately, no coolers for the P4 exist. So, I had to build something from scratch. At my employer ITQ, we have an Ultimaker 3D printer. And I’m quite handy with Tinkercad. So, the geek adventure of creating something cool began.

The first thing I did, was to measure all dimensions from the P4, the cooler and my case. I took a 80 mm silent cooler:

The next thing was to create an object in Tinkercad that was able to connect the fan to the GPU. This is the result:

Next, I calibrated the Ultimaker Original and created the output file that the printer gets. Printing had begun!

The final result was awesome. Everything did fit like it was meant to:

Next, it was time to fit the card and fan in my case, and that also fitted perfectly:

In front of the GPU cooler, there is another one taking all hot air out of the case.

Next, it was time to run the test again to see if the P4 was able to stay away from thermal challenges. Running doom on Full HD with a single monitor went pretty smoothly:

The session was hammering the P4 quite good. So, time to check the final result:

As you can see, the temperature is at 74 degrees Celcius (around 165 degrees Fahrenheit), which it stayed around for 10 minutes during a game of Doom. Until so far, I’m quite satisfied with the end result.

If you would like to download the files yourself, you can do so here:

The gCode file:

The OBJ file:

Johan van Amersfoort

Johan van Amersfoort

Johan van Amersfoort is a VCDX-DTM, VMware EUC Champion and vExpert working as a Technical Marketing Architect and EUC specialist at ITQ Consultancy. More about Johan can be found on the about page.
Johan van Amersfoort

Latest posts by Johan van Amersfoort (see all)

Johan van Amersfoort

Johan van Amersfoort is a VCDX-DTM, VMware EUC Champion and vExpert working as a Technical Marketing Architect and EUC specialist at ITQ Consultancy. More about Johan can be found on the about page.