Setting Up PyTorch with GPU Support on EC2 without Preconfigured AMIs

Author:Murphy | View: 25345 | Time: 2025-03-22 22:31:55

Amazon Web Service ("AWS") Elastic Compute Cloud ("EC2") presents a powerful and scalable option for computing. It allows developers to access virtual computing environments equipped with high-performance processing units like GPUs (Graphics Processing Units). These GPUs accelerate the training of complex machine learning models, enabling tasks that would be impractical or exceedingly slow on standard computers. This is particularly vital for deep learning models, which require substantial computational power to process large datasets and perform intricate calculations.

When you spin up an EC2 instance, AWS offers you the choice of configuring that instance from scratch or leveraging a prebuilt Amazon Machine Image (AMI). A prebuilt AMI is a template that contains a software configuration (An operating system, tools, and applications) for a specific purpose. For example, you might use a prebuilt AMI configured for deep learning.

Although the prebuilt AMIs are great, they aren't free and can increase the cost of your EC2 instance. Over a long enough period of time, these increased costs can become significant. By configuring your EC2 instance from scratch, you not only save on costs but also gain a deeper understanding of the setup process and the ability to tailor your environment to your specific needs.

Recently, I had to configure an EC2 instance from scratch. I spent a whole bunch of hours trying to piece together documentation from a variety of sources. The remainder of this post details the steps I took to configure the machine, and hopefully can save someone some confusion in the future.

Preparing Your Instance

As a disclaimer, this tutorial might not work out of the box. You need an AWS account with the required roles and permissions to create an EC2 instance. Additionally, AWS accounts don't come standard with access to GPU machines – you might have to submit a quota request increase to be be able to spin up an EC2 instance with a GPU. Feel free to reach out if you need help.

There are a multitude of ways you you can interact with AWS ranging from AWS management console to terraform. However, for the purpose of this tutorial we will interact with AWS via the AWS CLI via the terminal.

Set Up AWS CLI

Setting up the CLI is simple! You will need to provide your AWS Access Key ID , Secret Access Key, and region.

aws configure

aws sts get-caller-identity ##validate you configured this correctly

# {
#  "UserId": ,
#   "Account": ,
#   "Arn": 
# }

Spin up EC2 Instance

Amazon offers a plethora of EC2 instance types with different hardware. The price of each instance type varies based on hardware specifications. We will spin up a g4dn.xlarge, which is a low cost GPU machine designed for machine learning inference. Keep in mind, it's important to select the machine that is appropriate for whatever you are trying to do.

The first thing we need to do is select an AMI. Select a minimal AMI, in other words something that hasn't been configured for a specific use case. You can use the command below to get a list of AMIs and their IDs. Once you have selected an AMI, visit AWS's website to ensure it is not a paid AMI.

aws ec2 describe-images --filters "Name=description,Values=Amazon Linux 2 AMI*" --query "Images[*].[ImageId, Description]" --output text

# Keep track of the AMI ID, you will need it later

Next, we need to setup a keypair. A keypair is a public and private key you can use to access the EC2 machine securely.

aws ec2 create-key-pair --key-name  --query 'KeyMaterial' --output text > keypair.pem

chmod 400 MyKeyPair.pem

Additionally, we need to create a security group to dictate how we will access our instance. In our case, we will prepare our machine for SSH access. For security reasons, I recommend replacing the IP address in cidr with the IP address you want to access the instance from, otherwise it will be accessible from every IP address in the world

aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"

# The command above will output your group ID

aws ec2 authorize-security-group-ingress --group-id  --protocol tcp --port 22 --cidr 0.0.0.0/0

We are ready to spin up our instance.

aws ec2 run-instances --image-id  --count 1 --instance-type  --key-name MyKeyPair --security-group-ids

Finally, we can connect to our instance.

aws ec2 describe-instances --instance-ids i-1234567890abcdef0 --query "Reservations[*].Instances[*].PublicDnsName" --output text

# Command above wil output your instance public DNS

ssh -i MyKeyPair.pem ec2-user@YOUR_INSTANCE_PUBLIC_DNS

In your terminal, you will notice something like [ec2-user@~]$ prefacing each command indicating you are hitting your EC2 instance.

Configuring the Instance for PyTorch

At this point, we are connected to our EC2 instance and want to enable PyTorch to access the machine's GPU. Before installing PyTorch, we need to install a couple of supporting tools.

NVIDIA drivers: Allows operating system to communicate with NVIDIA GPU hardware.
CUDA Toolkit: Parallel computing platform that provides development environment for creating GPU accelerated applications.

Pre-installation Requirements

There is a variety of pre-installation requirements to ensure a seamless driver installation. NVIDIA describes some of these in their documentation, while others are courtesy of experience.

First, we will verify the system has a CUDA-capable GPU. If not, you need to launch different EC2 instance type. A list of CUDA-capable GPUs can be found here.

lspci -vv | grep -i nvidia

# 00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
#       Subsystem: NVIDIA Corporation Device 12a2
#       Kernel driver in use: nvidia
#       Kernel modules: nvidia_drm, nvidia

We will also verify the system is running a supported version of Linux. Our machine uses Amazon Linux 2, which is a supported version.

cat /etc/os-release

# NAME="Amazon Linux"
# VERSION="2023"
# ID="amzn"
# ID_LIKE="fedora"
# VERSION_ID="2023"
# PLATFORM_ID="platform:al2023"
# PRETTY_NAME="Amazon Linux 2023"
# ANSI_COLOR="0;33"
# CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
# HOME_URL="https://aws.amazon.com/linux/"
# BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
# SUPPORT_END="2028-03-15"

Additionally, we need to verify the system has correct Linux kernel headers.

sudo yum install kernel-devel-$(uname -r)

# Last metadata expiration check: 4:05:48 ago on Thu Mar  7 23:42:21 2024.
# Package kernel-devel-6.1.77-99.164.amzn2023.x86_64 is already installed.
# Dependencies resolved.
# Nothing to do.
# Complete!

Installing the NVIDIA drivers require build tools such as make and gcc.

sudo yum install gcc make

# Package gcc-11.4.1-2.amzn2023.0.2.x86_64 is already installed.
# Package make-1:4.3-5.amzn2023.0.2.x86_64 is already installed.
# Dependencies resolved.
# Nothing to do.
# Complete!

Next, we need to blacklist the Nouveau drivers. Nouveau is an open-source driver for NVIDIA graphics cards. However, it's often better to use NVIDIA's proprietary drivers for improved performance and expanded compatibility.

sudo bash -c "echo 'blacklist nouveau' > /etc/modprobe.d/blacklist-nouveau.conf"
sudo bash -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u

kernel-modules-extra provides less commonly used kernel modules for the kernel package. However, NVIDIA drivers require some of these less commonly used modules so we need to install them.

sudo yum install kernel-modules-extra

# Last metadata expiration check: 15:08:27 ago on Thu Mar  7 23:42:21 2024.
# Package kernel-modules-extra-6.1.77-99.164.amzn2023.x86_64 is already installed.
# Dependencies resolved.
# Nothing to

Finally, reboot your system for changes to take effect.

sudo reboot # You will need to SSH back in once the machine reboots

Installing NVIDIA Drivers, CUDA & PyTorch

Installing the NVIDIA drivers can be a quite easy… as long as you know what versions of each you need. Certain versions of PyTorch require certain versions of CUDA. Certain versions of CUDA requires certain versions of NVIDIA drivers. In our case, PyTorch 2.2 requires CUDA 12 and CUDA 12 requires NVIDIA Linux Driver 550.54.14. In my experience, the easiest way to figure this out is via the PyTorch release compatibility matrix and CUDA release notes.

Once you know what driver, locate the URL of the run file. Run files can be found here.

wget https://us.download.nvidia.com/tesla/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run

sudo sh ./NVIDIA-Linux-x86_64-550.54.14.run

Similarly, to install CUDA you need to find the URL of the run file. CUDA run files can be found here.

wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

sudo sh cuda_12.4.0_550.54.14_linux.run

We are finally ready to install PyTorch. You might have to install Python and PIP separately .

pip3 install torch torchvision torchaudio

Finally, we can confirm PyTorch can access the GPU! Using the GPU within the training pipeline or other parts of the machine learning lifecycle is beyond the scope of this blog post.

$ python3
>>> import torch
>>> print(torch.cuda.is_available())
# True

Closing Thoughts

Experienced AWS users might be wondering, why not just use SageMaker? AWS SageMaker is a fully managed service designed to allow developers to build, train and deploy machine learning models quickly and efficiently. Specifically, it provides an IDE with built in tools and capabilities such as:

Integrated Jupyter Notebooks
Built in Algorithms & Frameworks
Automated Model Tuning
Easy Deployment

In some cases SageMaker might be what you're looking for, however, building your own EC2 instance from scratch offers several benefits, such as:

Cost control: EC2 provides more direct control over the cost since you can manage do things like shut it down when it's not in use or bid on spot prices. SageMaker abstracts a lot of this, which can make the process more expensive.
Environment Customization & Flexibility: EC2 allows you to select your preferred operating system and configure the environment however you might like. This is helpful when a project needs a specific version of a library not supported by SageMaker.
Specific Resource Selection: Unique projects may require certain hardware optimized for compute, memory or storage. EC2 provides a wider variety of choices in this case.

At this point, you know how to configure your own EC2 instance to provide GPU access to PyTorch and why it can be a better option than SageMaker. The framework described in this post can be used to configure an EC2 instance for any purpose. The power of AWS EC2 lies in its versatility and scalability, making it an ideal platform for not just machine learning and Data Science projects, but for any computational task that demands flexible resource allocation and management.

Tags: AWS Data Science Deep Learning DevOps Machine Learning