Setting up a machine with GPU[s]in Google cloud — step by step instructions

Logging my experience setting up a Google cloud machine with one or more GPUs (up to 8 possible now — 4 June 2019) for running machine learning models.

For creating an instance with multiple CPUs, skip this article entirely and go directly to this link. @aankul.a explains the steps very clearly in his article. This article is relevant for procuring GPUs.

The overall sequence of steps are

  1. Create Google cloud account if we don’t have one.
  2. Create a project.
  3. Create VM instance with the desired number of GPUs, OS, hard disk space, CPUs, CPU memory, etc.
  4. Install Nvidia drivers for the specific OS (link to drivers for different OS specified below). Confirm installation with a GPU test.

Steps 1–4 should take 25–45 minutes on average. Subsequent provisioning (steps 3–4) may be on average 15–20 minutes.

Follow the Google cloud account to create account/signup. There is a $300 free credit for 12 months. There is also an always free option for limited access. This FAQ has details. Note GPU/TPU is not included in the free option.

2. Create Project

Click on the three dots in the Google cloud platform page to create a new project. Name the project.

3. Create VM instance

Click the create VM instance

The key choices to make in this step are GPU counts, CPU counts, CPU memory, OS type, hard disk space, enabling HTTP, and lastly preemptible/non-preemptible instance.

The default option does not show GPU choices — just CPUs. Click on the customize option to open the GPU selection window

In the choice of GPUs, we can choose from 4 GPU types to date — K80, P4, T4, and V100. For K80, and V100, we can choose up to 8 GPU machines. For P4 and T4 we can choose up to 4 GPU machines. The costs (displayed on the top right below) vary based on the machine type — K80 costs the lowest $0.349 hourly and V100 costs the highest — $1.77 hourly. These are non-preemptible instance costs. Preemptible instance costs are lower — $0.146 hourly for K80 and $0.751 hourly for V100. Preemptible option may be fine for some use cases, but we need to be careful when choosing this option because we will lose the entire machine including our work when the machine is pre-empted.

Few key factors to consider when choosing type of GPU is

  • On board GPU memory capacity (non-configurable — comes with machine type). This can be a limiting factor when loading some models — for instance certain BERT models will fail to load into GPU with out of memory CUDA errors. So we need to factor that. GPU memory could range from 8GB to 16GB per GPU, based on the GPU type. This link has the capacity for all GPU types
  • The performance of the GPU — K80 on the low end and V100 on the high end.
  • Not all data centers/regions have all the GPU types. Also trying to create new multi GPU instances in certain regions (US east coast, US central ) can fail during certain times of day. Typically when a create instance fails, I have had very little luck retrying in that center. I try another center.

Choose the number of CPUs required for our task. This is typically needed only if we are doing some pre-processing or post-processing on multiple CPUs. Choosing a large number of CPUs ( > 16) sometimes may have an impact on the GPU count we chose. That is the center may not have a machine configuration with the CPU and GPU counts we need.

We can choose the amount of CPU memory we need based on our task needs. There is an extended memory option where pricing for extra memory kicks in above a threshold based on usage

The desired OS is chosen by the option below.

We have several choices, Debian, Ubuntu, Red Hat, Windows etc. (it is a scrollable region) .There are some images with Pytorch and Tensorflow installed that is worth considering in some cases.

Disk space choice is in the same window as the OS choice. Choosing the right amount of disk space is key. It is worth being conservative to avoid restarting all over again — we need to factor in the disk space for software (e.g. anaconda, tensorflow, pytorch, our machine learning model, data etc.)

This option below enables HTTP access. We need to make sure to enable them if we need HTTP access (e.g. hosting a model for outside testing etc.)

This option enables us to choose pre-emptible machines (default is non-preemptible). Choosing pre-emptible instances should be chosen with caution as described earlier. It is worth checking the default is indeed non-premptible if we do not want the rug pulled from under our feet.

We can now click create instance button in the bottom of the page to create instance. If we succeed (this may fail in some regions due to lack of GPUs — we need to retry) we will have an instance listed as shown below.

We can now login to our machine right from this screen (ssh option next to the red rectangle) or from a shell on our laptop using the external IP address listed (masked in red). Logging in directly from the instance page sometimes may take too long, so it is best to cut and paste our ssh keys into the metadata/ssh keys panel shown below

The create instance steps we did above could be all done with a console application gcloud

When working on remote machine, as we do normally in any remote login, it may be best to use an application like screen to have work preserved and running even if we lose connection.

4. Install Nvidia drivers

We need to install Nvidia drivers if we start off with an OS image as we did above. Before that we might have to install some basic utilities.

sudo apt-get install zipsudo apt-get install bzip2sudo apt-get install vim

The driver install code for various OS versions is listed here. Sample of this page shown with drivers for various OS choices

Cut and paste the code snippet for the OS image we chose into a file say driver_install.sh. Then on the remote machine

sudo driver_install.sh

Once the installation completes (some installs may prompt us in between for choosing language of choice etc.), run the following command to check if we see the GPUs are listed.

nvidia-smi

Sample images for a single GPU instance and a 4 GPU instance. The command above is also useful to check running GPU memory usage as we run our tasks

Single GPU instance output of nvidia-smi command

4 GPU instance output of nvidia-smi command

8 GPU instance output of nvidia-smi command

Once we are done with our task, we need to remember to release the machine to avoid unnecessarily paying for compute cycles we didn’t use. The option to release ( the three vertical dots in figure below is a menu pop for stop / reset / delete etc. )

Lastly my experience so far with Google compute machines are that they are reliably accessible through ssh from my laptop ( the browser shell login is extremely unreliable -it may at times sit there forever trying to log us in). I have had one bad experience of a non-preemptible machine becoming completely inaccessible in the last six months of use. I lost about a weeks worth of work. Since then I am cautious to save my work remotely at intermediate steps.

Machine learning practitioner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store