Setting up a machine with GPU[s]in Google cloud — step by step instructions
Logging my experience setting up a Google cloud machine with one or more GPUs (up to 8 possible now — 4 June 2019) for running machine learning models.
For creating an instance with multiple CPUs, skip this article entirely and go directly to this link. @aankul.a explains the steps very clearly in his article. This article is relevant for procuring GPUs.
The overall sequence of steps are
- Create Google cloud account if we don’t have one.
- Create a project.
- Create VM instance with the desired number of GPUs, OS, hard disk space, CPUs, CPU memory, etc.
- Install Nvidia drivers for the specific OS (link to drivers for different OS specified below). Confirm installation with a GPU test.
Steps 1–4 should take 25–45 minutes on average. Subsequent provisioning (steps 3–4) may be on average 15–20 minutes.
1. Google account creation
Follow the Google cloud account to create account/signup. There is a $300 free credit for 12 months. There is also an always free option for limited access. This FAQ has details. Note GPU/TPU is not included in the free option.
2. Create Project
Click on the three dots in the Google cloud platform page to create a new project. Name the project.
3. Create VM instance
Click the create VM instance
The key choices to make in this step are GPU counts, CPU counts, CPU memory, OS type, hard disk space, enabling HTTP, and lastly preemptible/non-preemptible instance.
3a. GPU selection
The default option does not show GPU choices — just CPUs. Click on the customize option to open the GPU selection window
In the choice of GPUs, we can choose from 4 GPU types to date — K80, P4, T4, and V100. For K80, and V100, we can choose up to 8 GPU machines. For P4 and T4 we can choose up to 4 GPU machines. The costs (displayed on the top right below) vary based on the machine type — K80 costs the lowest $0.349 hourly and V100 costs the highest — $1.77 hourly. These are non-preemptible instance costs. Preemptible instance costs are lower — $0.146 hourly for K80 and $0.751 hourly for V100. Preemptible option may be fine for some use cases, but we need to be careful when choosing this option because we will lose the entire machine including our work when the machine is pre-empted.
Few key factors to consider when choosing type of GPU is
- On board GPU memory capacity (non-configurable — comes with machine type). This can be a limiting factor when loading some models — for instance certain BERT models will fail to load into GPU with out of memory CUDA errors. So we need to factor that. GPU memory could range from 8GB to 16GB per GPU, based on the GPU type. This link has the capacity for all GPU types
- The performance of the GPU — K80 on the low end and V100 on the high end.
- Not all data centers/regions have all the GPU types. Also trying to create new multi GPU instances in certain regions (US east coast, US central ) can fail during certain times of day. Typically when a create instance fails, I have had very little luck retrying in that center. I try another center.
3b. CPU count
Choose the number of CPUs required for our task. This is typically needed only if we are doing some pre-processing or post-processing on multiple CPUs. Choosing a large number of CPUs ( > 16) sometimes may have an impact on the GPU count we chose. That is the center may not have a machine configuration with the CPU and GPU counts we need.
3c. CPU memory
We can choose the amount of CPU memory we need based on our task needs. There is an extended memory option where pricing for extra memory kicks in above a threshold based on usage
3d. OS type
The desired OS is chosen by the option below.
We have several choices, Debian, Ubuntu, Red Hat, Windows etc. (it is a scrollable region) .There are some images with Pytorch and Tensorflow installed that is worth considering in some cases.
3e. Disk space
Disk space choice is in the same window as the OS choice. Choosing the right amount of disk space is key. It is worth being conservative to avoid restarting all over again — we need to factor in the disk space for software (e.g. anaconda, tensorflow, pytorch, our machine learning model, data etc.)
3f. HTTP enable
This option below enables HTTP access. We need to make sure to enable them if we need HTTP access (e.g. hosting a model for outside testing etc.)
3g. Creating non-preemptible machines
This option enables us to choose pre-emptible machines (default is non-preemptible). Choosing pre-emptible instances should be chosen with caution as described earlier. It is worth checking the default is indeed non-premptible if we do not want the rug pulled from under our feet.
We can now click create instance button in the bottom of the page to create instance. If we succeed (this may fail in some regions due to lack of GPUs — we need to retry) we will have an instance listed as shown below.
We can now login to our machine right from this screen (ssh option next to the red rectangle) or from a shell on our laptop using the external IP address listed (masked in red). Logging in directly from the instance page sometimes may take too long, so it is best to cut and paste our ssh keys into the metadata/ssh keys panel shown below
The create instance steps we did above could be all done with a console application gcloud
When working on remote machine, as we do normally in any remote login, it may be best to use an application like screen to have work preserved and running even if we lose connection.
4. Install Nvidia drivers
We need to install Nvidia drivers if we start off with an OS image as we did above. Before that we might have to install some basic utilities.
sudo apt-get install zipsudo apt-get install bzip2sudo apt-get install vim
The driver install code for various OS versions is listed here. Sample of this page shown with drivers for various OS choices
Cut and paste the code snippet for the OS image we chose into a file say driver_install.sh. Then on the remote machine
Once the installation completes (some installs may prompt us in between for choosing language of choice etc.), run the following command to check if we see the GPUs are listed.
Sample images for a single GPU instance and a 4 GPU instance. The command above is also useful to check running GPU memory usage as we run our tasks
Single GPU instance output of nvidia-smi command
4 GPU instance output of nvidia-smi command
8 GPU instance output of nvidia-smi command
Once we are done with our task, we need to remember to release the machine to avoid unnecessarily paying for compute cycles we didn’t use. The option to release ( the three vertical dots in figure below is a menu pop for stop / reset / delete etc. )
Lastly my experience so far with Google compute machines are that they are reliably accessible through ssh from my laptop ( the browser shell login is extremely unreliable -it may at times sit there forever trying to log us in). I have had one bad experience of a non-preemptible machine becoming completely inaccessible in the last six months of use. I lost about a weeks worth of work. Since then I am cautious to save my work remotely at intermediate steps.