Here we look at running computing-intensive machine learning jobs using Google Cloud Platform (GCP) with TensorFlow, and also doing the same on AWS/EC2 GPU instances, from the perspective of cost effeciency, training time, and operational issues.
This investigation is part of my effort in the open-source project terraAI, which requires a great deal of computing power for Machine Learning.
Please note that:
- For information on how to set up the GCP+TensorFlow environment, please see my previous post.
- For information about researches related to the DCGAN machine learning systems tested, please see a separate post here. The possible application of DCGAN is investigated in the How to Build a Holodec series.
The following was recorded in October 2016. Since I expect GCP and TensorFlow to evolve quickly, the following information may not be applicable after a while.
Preemptible/Spot instances
Both AWS/EC2 and GCP offer substantial discount on interruptible VM instances which use the platform's excess computing capacity. These are called the preemptible instances on GCP, or the spot instances on AWS/EC2. Such instances could get terminated at anytime due to system events (such as when the available capacity is tight), bid price exceeded (AWS/EC2), time limit exceeded (no more than 24 hours on GCP), etc.
Following is a pricing example for GCP:
Machine type: n1-highcpu-326
vCPUs: 32
Memory: 28.80GB
GCEU: 88
Price (USD) per hour: $0.928
Preemptible price (USD) per hour: $0.240
As can be seen above, the discount is quite substantial. It is also worth noting that AWS/EC2 supports a biding mechanism, so that it is possible to bid with a lower price for spot instances if you are willing to wait for better prices.
Note the following limitations for GCP:
Preemptible instances cannot live migrate or be set to automatically restart when there is a maintenance event. Due to the above limitations, preemptible instances are not covered by any Service Level Agreement (and, for clarity, are excluded from the Google Compute Engine SLA).
Handling instance interruption is very important, for otherwise you may lose the result from the training sessions that take many days to run.
Following are some notable differences between the two platforms on dealing with instance termination:
- AWS/EC2's spot instances can get terminated on a two-minute warning (as opposed to the more predictable 24-hour limit one GCP). On AWS/EC2 the main cause for instance termination is when your bid price is exceeded by the market price (which changes all the time).
AWS/EC2 requires you to poll for termination notice, which is more cumbersome than GCP's asynchronous notification mechanism.
Details about the AWS/EC2 spot interruption polling mechanism can be found here. Following are what happen when an instance is preempted:Following are what happen when a preemption occurs on GCP:
- GCP's Compute Engine sends a preemption notice to the instance in the form of an ACPI G2 Soft Off signal. You can use a shutdown script to handle the preemption notice and complete cleanup actions before the instance stops.
- If the instance does not stop after 30 seconds, Compute Engine sends an ACPI G3 Mechanical Off signal to the operating system.
- Compute Engine transitions the instance to a TERMINATED state. You can simulate an instance preemption by stopping the instance.
Per TensorFlow documentation, an AbortedError exception is raised in case of such preemption.
Restarting an instance
To restart a spot/preemtible instance after interruption:
- GCP: this is a simple matter of restarting the stopped instance from the console. There is no need to reconfigure a new preemptible instance.
- AWS/EC2: it can be pretty tedious on AWS/EC2 in some situations:
- If the instance is defined as an one-time spot instance, then you will need to relaunch a new spot instance, and going through all the configuration choices, which is a chore. Wish AWS can provide a way to save such configuration choices so that it is possible to relaunch an one-time spot instance with just one click.
- If the instance is defined as a persistent spot instance then it could get restarted automatically when the condition is right. Here you need to make sure that your instance is configured correctly to get your Machine Learning job going automatically on reboot, for otherwise you will be wasting money with the instance staying idle.
- There is no way to temporarily pause (or stop, in AWS/EC2 terminology) a spot instance. The best you can do is to terminate the instance then relaunch it (and going through the hassle of having to reconfigure the launch).
Pricing for GCP On-demand Instances
As a comparison, pricing examples for the GCP on-demand instances are given below:
- GCP Debian VM instance, 1 vCPU, CPU: Intel Haswell, 3.75 GB, cost: ~USD$30/month
- GCP Debian VM instance, 8 vCPUs, CPU: Intel Haswell, 30 GB, cost: ~USD$200-300/month
No GPU instance is available on GCP as of this writing.
The pricing above is for reference only. Since our goal is to run computing-intensive deep learning tasks, below we will compare GCP's 8-vCPU instances with AWS/EC2's GPU instance.
GCP vs AWS/EC2's GPU instances
Following are some results from running the same TensorFlow test case on GCP, and also AWS/EC2 with GPU.
Test case used
I used a DCGAN (Deep Convolutional Generative Adversarial Networks) implementation as test case below, mainly because of my interest in image (and later 3D models) generation (see my How to Build a Holodeck series. My thoughts about how to apply DCGAN towards such a goal can be found here.
- The TensorFlow implementation of DCGAN of the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
- The Celebrity image dataset (celebA) is used.
- The following test uses the AWS/EC2 Spot Instance and the GCP Preemptible Instance which are the much cheaper versions of the regular on-demand instance.
Environments tested
- GCP:
- Hardware: 8 vCPUs, 15GB memory, 20GB disk. Note that as of this writing no GPU instance is available on GCP.
- OS image: Debian GNU/Linux 8 (jessie)
- Software: TensorFlow 0.11, Python 2.7, installed under Anaconda (v 4.2.9)
- Preemptible instances are used for lower cost.
- GCP storage is used to persist changes between VM instances.
- AWS/EC2:
- Hardware: GPU instance g2.2xlarge (current generation), 8 vCPUs, 15GB memory, 60GB SSD
- OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind. This AMI comes with everything needed for the test pre-installed, except for scipy.
- Software: TensorFlow 0.9, Python 2.7, Cuda 7.5
- Spot GPU instances are used for lower cost
Results
Please note the following are rough comparisons. In particular the cost factors often vary greatly by region, are constantly adjusted, and might be affected by many different options (such as load balancing).
- Time to execute one epoch
- AWS/EC2 with GPU: 5032 seconds
- GCP: 34505 seconds
- Cost.
- AWS/EC2:
- Spot instance: USD$0.10 per Hour
- On demand: USD$0.65 per Hour
- GCP pricing:
- Preemptible: USD$0.06/hour
- On demand: USD$0.30/hour
- AWS/EC2:
Summary
- The AWS/EC2 instances are substantially more cost efficient also many times faster than GCP in training the same DCGAN model using TensorFlow. While the Google Cloud Platform has many things going for it, the lack of GPU instance support (or perhaps TPU support one day) really makes it uncompetitive for training Machine Learning models at this time.
- AWS/EC2's spot instances cannot be paused (i.e., stopped), but GCP's preemptible instances can. This means that if you have allocated a large and very expensive AWS/EC2 GPU spot instance (such as the p2.16xlarge which costs USD$144.0 per Hour), and you wish to pause it a bit for some reasons, then it is more problematic to deal in AWS/EC2 with than in GCP. Here you basically need to do the following before terminating an instance:
- Make sure that your ML code is well written to checkpoint and reload partial results as needed.
- Copy partial results to a persistent storage (such as a mounted AWS/S3 bucket)
- Make an image out of the current instance, if you have installed or configured something that you wish the next spot instance to pick up.
AWS/EC2 offers several GPU tiers, including the following (spot instance pricing, as of this writing, all based on Linux, current version GPU):
- g2.2xlarge USD$0.10 per Hour (tested above).
- g2.8xlarge USD$0.611 per Hour
- p2.xlarge USD$0.1675 per Hour
- p2.8xlarge USD$72.0 per Hour
- p2.16xlarge USD$144.0 per Hour
Curiously the spot pricing for p2.8xlarge and p2.16xlarge are much higher than the on-demand versions. Not sure why this is the case.
A strange ramp up effect was observed for the test case, where it seems to be unusually slow at the beginning. Details as follows (for the g2.2xlarge instance):
- If extrapolating from the first 2% of the epoch the cost should be USD$5.9/epoch
- If extrapolating from the first 10% of the epoch then the cost should be USD$1.93/epoch
- If extrapolating from the final 10% of the epoch then it should take 5127 seconds to run one epoch, about the same as a g2.2xlarge instance but at a much higher cost.
As such the computing time are extrapolated from the latter half of the first epoch to represent the steady-state throughput.
Using the test case above with all other parameters staying the same, the following are partial results for running the test (measured in cost per epoch).
- g2.2xlarge: 5032 seconds/epoch * USD$0.10/hour = USD$0.14/epoch
- g2.8xlarge: 5788 seconds/epoch * USD$0.611/hour = USD$0.98/epoch. It is unexpected that this turns out to be slower than g2.2xlarge. It is suspected that there is some kind of configuration error, but none were found.
- p2.xlarge: 3795 seconds/epoch * USD$0.1675/hour = USD$0.177/epoch.
- p2.8xlarge: unable to test due to There is no Spot capacity for instance type p2.8xlarge in availability zone
- p2.16xlarge: Not tested.
- Persistent storage. In my tests I use persistent storage (i.e., AWS/S3 buckets, or Cloud Storage on GCP) for storing computing results independent of the VM instances. This is a very handy arrangement, but the following should be noted:
- Such persistent storage are much slower (for both AWS and GCP) than the local disk on a VM instance. For example, I have found that simple operations (listing, unpacking, moving, reading) on a large dataset with 200000 images could take hours or days (!). I ended up putting such datasets on the local disk, which also means that I need to create launchable image that include such dataset, so that the next VM instance can pick it up. This is far from ideal.
Recommendation
-
Overall AWS/EC's g2.2xlarge seems to be a good value if you are on a budget. It is the least powerful current version GPU as offered on AWS/EC2, but once you have it set up you can easily scale to a higher GPU where you can pay more for speed. If you are not in a hurry running your experiments, then one strategy is as follows:
- Use spot instances which cost a fraction of the on-demand version.
- Make sure that your program checkpoints its vital contents often, and that it can stand up to frequent unexpected termination and restart. Luckily TensorFlow has good support for saving and restarting models, so most programs written in TensorFlow are pretty good in this respect.
- Set a sport instance bid price for around half of the on-demand instance price. This way, your spot instance won't get terminated too often, while you still can take advantage of the long stretches of low price that is often available for spot instances.
et a reasonable bid price about half of the on-demand version, so that it does not get terminated too often, and
1. The top-end GPU instances, such as AWS/EC2's p2.16xlarge or p2.16xlarge, are not cheap. If you plan on running heavy machine learning jobs constantly, than buying your own GPU (e.g., the NVIDIA GeForce GTX Titan X) could be more cost effective. However, the cloud environment makes it much simpler to scale up and down computing power at will, and also simplifies access, monitoring, and management. Which approach is better really depends on the weight that you give to each factor (e.g., cost, convenience, scalability, ease of management, etc.).
1. I have used a TensorFlow test case in my experiments above, mainly because TensorFlow is a, open-source software library designed for scalability. If you expect that your Machine Learning system needs to be deployed in very large scale one day, then I'd recommend that you also implement your code based on TensorFlow.
1. Keep your eyes on GCP, since while it is not very useful for doing Deep Learning researches at this time, I expect/wish that it will catch up with AWS/EC soon. Note that GCP does offer a range of Machine Learning services which are supposedly highly scalable, but since my interest is in conducting ground-breaking researches, for my purposes I have no need for those pre-packaged services.
Related posts
- Hands-on with TensorFlow on GCP - set up: my experience with setting up a Machine Learning environment using the Google Cloud Platform.
- Image interpolation, extrapolation, and generation: looking into the possibility of using the DCGAN for the purpose generating images (and eventually 3D models) from textual commands. This is part of the How to Build a Holodeck series.
- How to Build a Holodeck.