These days some models are being released with implementations running only on TPUs (e.g. text-to-text transformer )

Logging my experience setting up a Google cloud TPU node for running machine learning models.

The overall sequence of steps are

  1. Create Google cloud account if we don’t have one.
  2. Create a project.
  3. Create VM instance with a single CPU or GPU, OS, hard disk space, CPU memory, etc.
  4. Install tool (ctpu) on procured VM instance to create, manage and delete TPU instances. This step would require authorization which can be done using gcloud (comes installed by default in VM instances).
  5. Setup google cloud storage bucket.
  6. We can now use TPUs and store data in bucket. After using TPUs remember to release tpu instance, bucket storage (after saving output if needed), and VM instance

Instructions for steps 1–3 are covered in the article for setting up VM instances.

4. Installing and using ctpu

Fetch ctpu and include its location in PATH.

The command ctpu enables us to provision, manage and delete TPUs. However we need to enable this command to be authorized to perform these operations. This can be done by using gcloud as shown below. Type the command below and follow instructions to copy-paste the key from browser

The simplest usage of ctpu is to create a TPU instance, check status, and to delete as shown below

In most cases however, we may need to specify additional options. For instance models may require bringing up TPU with specific versions of Tensorflow etc. (e.g. text-to-text transformer, the up command would be specified with additional options. )

5. Setting up and managing bucket storage

To run a model, we may have to copy the model either to our VM instance or store the model in a bucket. We can use gsutil (also comes installed by default on VM instances)

Operations on bucket, would require authorization which can be done using the command below and following instructions to cut and paste back auth code from the browser URL provided

6. Deleting instances

Finally when our work is done, we can delete our TPU, bucket storage and VM instance

One anomalous behavior of ctpu is that it may not show the active TPU node we provisioned, when checking on status. This happens at times when we procure TPUs either from a VM instance or from Google cloud shell.

We can however, delete the TPU instance by the delete option in the web interface for Google cloud platform. Any active TPU node always shows up in web interface (so does VM instance even if we provision it with gcloud utility) VM instance as well as data bucket can also be deleted from the Google cloud platform web interface.

One could potentially avoid creating a VM instance in the first place by just using a Google cloud shell (not gcloud utility which gives us fine grained control of specifying required memory etc.) , but this may not be possible in all cases due to memory/resource constraints in google cloud shell.

Lastly, the VM instance we provision could be just a CPU or a GPU based on the model requirements.

This answer is imported manually from Quora. https://qr.ae/TWPFEO

Machine learning practitioner