Cloud computing with R
25 July 2019
Cloud computing provides computing resources that you can access remotely. This seminar by Ian Durbach gives an accessible, hands-on introduction to setting up R and Rstudio in a cloud environment. Most of the seminar focuses on creating a virtual machine using the cloud resources provided by Amazon Web Services. Ian demonstrates how to set up the virtual machine, get RStudio up and running, run R scripts, and upload/download data.
Cloud computing can be useful for doing large-scale analyses, for speeding up more conventional analyses, or simply to get an analysis off your computer so you can work on something else. Ian also briefly covers some other cloud (RStudio Cloud) and non-cloud options that might be useful in some cases (RStudio's new "jobs" feature).
If you use RStudio on the cloud without using the command line, you have a few options:
This is the fastest and easiest way to use RStudio on the cloud.
Sign up for a shinyapps.io account here. The account is free.
Go to https://protect-za.mimecast.com/s/yFvJCj2gl8i6QAMmIxxkI1 and sign in using your shinyapps.io username and password.
RStudio Cloud is based around the idea of R Projects that we cover in the 'workflow'? notebook. To start with you’ll have no existing projects, so start a new one by clicking on 'New Project'? and filling out a bit of information about your new project.
This automatically starts a remote instance on the RStudio cloud server. You’ll see the usual RStudio windows, but now you’re working on the cloud rather than your computer. You can install packages, upload and download data, and generally do everything that you would usually do in R. For more information select the 'Guide'? from the left-hand taskbar.
Google Colaboratory or Colab is a free service – all you need is a Google account. By default Colab notebooks use Python, but an R kernel is available that lets you run R. The big advantage of Google Colab is that it offers both CPU and GPU instances at no cost.
Go to https://protect-za.mimecast.com/s/tjNhCk5jm7SLz46BI0InrC. The first time you visit this site you should click on any of the three demos. Later on, you can select any notebook that you have already saved in your Google Drive.
Select File and Save a copy in drive. This saves a copy of the demo notebook to your own Google Drive.
Clear the demo code by selecting Edit then Select all cells then Delete selected cells. This clears the notebook. Change the name of the notebook by clicking on the heading Demo.ipynb and giving the notebook a new name. You can then add your own code cells or text cells as required.
To use a GPU instance, select Runtime, Change runtime type, and Hardware accelerator. Choose GPU or TPU as required. Only use these if you are doing some kind of analysis that requires these resources.
Amazon Web Services
Amazon Web Services offer the most flexible of the options we look at, but these can cost money and are potentially tricky to set up. This notebook describes far and away the easiest way to get started using Amazon Web Services, Keras and R (as far as I know). Louis Allett (https://protect-za.mimecast.com/s/v7-iClOkn7I9Nk5EhN7K6e) has set up Amazon Machine Images (AMIs) containing R, R Studio, Python, and a bunch of other software you might want. The best thing about these is that you work directly in R Studio, set up and deployment is all through your web browser, and you can connect to your Dropbox account to access your data and save results. This notebook is basically a minimal set of instructions for getting set up and started on one of these images. More detailed instructions and details are available on Aslett’s website.
Getting started with Amazon Web Services
Go to Amazon Web Services and 'Create a free account'?. Creating an AWS account is free and gives you immediate access to the AWS “Free Tier'?. On the Free Tier, you can use a t2.micro instance for up to 750 hours per month free, for 12 months.
Once you have created your account, go to https://protect-za.mimecast.com/s/NgylCmw0oyU0N6KlIw_VCZ. From the box on the right-hand side of the screen, choose an AMI (I use EU (Ireland) but it doesn’t really matter). This will take you to the AWS log in page. Log in.
After logging in, you will automatically be taken to a page where you can specify settings for the AMI you selected in the previous step. Here the main thing you need to choose is the type of machine (AWS calls these 'instances'?) you want to work on. More powerful instances are more expensive. The t2.micro instance is free and you should start here. Select this instance (it should be selected by default) and click 'Next: Configure Instance Details'?.
Step 3.1 (Configuring Instance Details)
Leave everything as default here and click 'Next: Add Storage'?.
Step 3.2 (Adding Storage)
Recommended step here is to leave as is and click 'Next: Add tags'?. Most instances comes with a root volume of 50Gb, which is more than enough for our needs. You pay for storage space regardless of whether you use it or not, at a rate of about 10 USc per Gb per month. Costs are prorated by the number of hours you have the storage available for i.e. you don’t pay for a month if you terminate an instance after a few days.
Step 3.4 (Configuring Security Group)
This step is important as it specifies the IP addresses that can access your instance. We’ll add one rule:
1. Click "Add rule". 2. Set "Type" to *Custom TCP*, "Protocol" to *TCP*, "Port Range" to *80*.
You can add or modify security at any stage, for example if you only want to allow your IP address to access the instance. We won’t do this here. Click 'Launch'?.
A window will come up 'Select an existing key pair or create a new key pair'?. As you use AWS more, and particularly if you use it to work with private or sensitive data, you will definitely want to set up key pairs to allow you to securely access your instance. The ease-of-use of this AMI comes at some cost to security. That is not too much of a concern when getting started, but it does mean that you must change the username and password from 'rstudio'? (Aslet’s defaults) a bit later on (see Step 11). Click 'Proceed without a key pair'? and tick the box below. Click 'Launch instances'?. On the next screen click 'View instances'?.
You will then be taken to the 'Instances'? dashboard, which you can always access by selecting 'Instances'? from the menu on the left hand side of the screen. The instance you just created will probably say 'Initializing'?, which refers to some checks that are done when an instance is started. These checks can take some time to complete (a few minutes). Wait until the 'Status Checks'? column says 2/2 checks…, which means the checks are complete.
The Instance dashboard is your 'main'? page on EC2. It shows you which of your instances are running (green circle, labelled running) and which are available but not running (red circle, labelled stopped). The running instances are the expensive ones. Do not leave an instance running unless you need to! We need to leave the instance we just created running while we add some software and do the rest of this notebook, but afterwards, remember to come back to this screen and stop the instance! You stop an instance by selecting the box next to its name, clicking the 'Actions'? button and choosing 'Instance state'? and 'Stop'?. When you want to start the instance again, do the same thing but choose 'Start'?. When you are totally done with the instance, do the same but choose 'Terminate'?. Once you have terminated an instance you will not be charged anything for it, but of course if you want to do anything you will need to create a whole new instance from scratch. It really boils down to how much you value your time!
Once the checks are complete, scroll down and find the 'IP v4 Public IP'? address. Copy and paste this into a new tab of your web browser. When prompted, enter 'rstudio'? as both the username and password. You should now see good old R Studio in your browser. You’re on the server!
You can now work in R Studio as you would normally, except that you are now working on your own AWS instance with, depending on the instance you selected, a lot more computing power at your disposal!
Carefully read the 'Welcome.R'? script that appears. This contains a lot of useful information, including how to change your password ('highly recommended'?!) and how you can connect the R Studio Server to your Dropbox account and access data and save results that way.
Install and configure the keras R package by entering the following lines one at a time in the R console.
install.packages("keras") library(keras) install_keras()
You are now ready to do whatever modelling you want to do.
Step 10 (IMPORTANT!!!!)
When you are done, don’t forget to go back to your AWS EC2 dashboard and Stop or Terminate the instance. Stopped instances cost a bit of money, even when stopped (you keep paying for storage) but have the advantage that they can be 'started'? agin (which means you don’t have to go through steps 2-9 each time). Once an instance has been terminated, no costs are incurred but it cannot be started again; it, and anything you stored on it, is gone forever. In any case, you should always remember to either stop or terminate your instances after use. Otherwise you keep paying!
If you are going to use AWS regularly, or use more expensive GPU instances, I strongly recommend finding out how to make use of spot requests (https://protect-za.mimecast.com/s/BWdrCoYnqyto38LYImeQh5).
If you just want to set up a basic 't2'? instance, you can skip this step. For some kinds of servers (e.g. the 'p2'? instances used for GPU computing) you need to make a special request for access before you can launch a instance. Select 'Limits'? from the menu on the left side of the screen. Scroll down until you find 'p2.xlarge'?. If the Current Limit is 0, then click the 'Request limit increase'? link and fill in the form there. It can take a few hours to get access. See here for more details. Note you cannot launch an instance before your limit is increased beyond zero (you will see a 1 or 2 or whatever your limit is reflected in the Current Limit column).
If you are going to use AWS with paid instances (not free ones) its a good idea to set up a billing alert. Unfortunately AWS doesn’t allow you to set a hard budget limit (as far as I know), but bill alerts will trigger an email to you tell you that you have spent a certain amount of money. See here.