This is a starting point for series of blog posts about NVAIE (aka NVIDIA AI Enterprise). I will cover this topics in 3 parts:
- Part 1, Installation on VCF 4.3 (or any vSphere 7.0U2)
- Part 2, Use Cases
- Part 3, NVAIE with Tanzu
Quickly NVAIE is a scalable solution for Artificial Intelligence for Enterprise. It runs on vSphere with a minimum overhead but with all flexibility of VMware SDDC. Obviously, NVAIE is supported to run with Tanzu.
Disclaimer: We're using a non certified hardware, acceptable for testbed and demo. For production environment, please select a fully certified hardware.
A good start is always with the RTFM activity. You'll get all details in this link describing the joint solution between NVIDIA and VMware.
Prerequisite step is to have vSphere 7.0U2 environment. Our Lab and Demos cluster ran VCF 4.1, my first partner in crime Mikael Tissandier (HCI Specialist, VMware) had quickly upgraded VCF to VCF 4.3 giving us a big refresh for ESX, VCSA and NSX-T without any interruption of our demos - quite impressive. And now we're using VCF 4.3 with ESX 7.0U2 which is a strong prerequisite for NVAIE.
Same time we have installed two NVIDIA T4 GPU cards in the PCI-e mezzanine on our Intel Hosts, one card on each ESX:
The mezzanine we have is the model with only 8x data and 16x electrical. It's enough for demo purposes but for production a full 16x PCI-e is recommended for better performances.
If in VCSA, for ESX you don't see "PCI Devices" menu, you should check if VT-d is enable in BIOS:
Now, it's time to install NVIDIA software with the help of my second partner in crime Mahmoud El Ghomari (Professional Visualization Senior Solution Architect, NVIDIA).
NVAIE needs a license server with the possibility to install in HA mode (Primary and Secondary VMs). I have choosed to run it on CentOS 8 to get latest release of license server. You have the choice of Windows, Red Hat Linux, CentOS and Ubuntu. I have followed this guide. My recommendation is to disable Firewall for 7070/tcp on CentOS with:
# firewall-cmd --zone=public --permanent --add-port=7070/tcp
Ask your nvidia sales representative trial licenses or even better purchase them. You need to insert them into the license server to use vGPU.
You need to download NVAIE Host Software and vGPU Guest Driver, for that refer to the ngc documentation.
Steps to install VIB in ESX:
- Copy VIB file in a datastore
- Put ESX in maintenance mode
- ssh with root to esx
- Retrieve absolut path of VIB file
esxcli software vib install --no-sig-check -v /vmfs/volumes/vsan-intel-datastore/BITS/NVIDIA-AIE_ESXi_7.0.2_Driver_470.53-1OEM.702.0.0.17630552.vib
- test it with:
It should give you this:
You have also to verify Graphics options is "Shared Direct / Spread VMs" in VCSA:
It was not necessary to reboot my ESX, in any doubt and because ESX is in maintenance mode it could be safer to reboot host. Exit ESX from maintenance mode...
That's it for vSphere, it' not a big deal.
In order to run AI workload we need VMs and containers. Let's start in creating Ubuntu VM in following this doc.
But my recommendations is to increase Disk to 200Gb, personally I use LVM in linux to extend root fs, without this VM has not enough space to download all framework:
And follow Docker installation + Download Framework
Last step is to validate with this guide. You should get:
I have executed in container this command in verbose mode, like this:
/workspace/tensorrt/bin/trtexec --batch=128 --iterations=400 --workspace=1024 --percentile=99 --deploy=ResNet50_N2.prototxt --model=ResNet50_fp32.caffemodel --output=prob --int8 --verbose
Great! You're ready now to go further with ML/AI.