What’s really cool is that Keras in R now has an interface to AI Platform (fka Cloud ML) in Google Cloud Platform. You do some trial-and-error on your computer and once you’re good to go, you can submit your job with a couple lines of code.
This piece is part of a series of blog posts on using Keras in R:
– Part 1: Installing and Debugging
– Part 2: Training a basic model
– Part 3: Hypertuning a model
– Part 4: Running & hypertuning a training job in the cloud (GCP)
Just to be clear: you need a working project in Google Cloud. You can get a free $300 to spend if you’ve never registered for an account. If that’s done, go to AI Platform in the GCP menu and follow the instructions to enable this component. If you got that set up, you can proceed to the rest of this blog.
There is a compatibility issue with R versions > 3.5. If you see the following error in the logs of your job, it means you use R version 3.6.0 on your local computer. This version uses a non-backward compatible way of saving RDS files. Once this RDS file is uploaded to AI Platform, it’s unreadable, since version 3.5.0 is used in there.
Error in readRDS(“cloudml/deploy.rds”) :
cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
The replica master 0 exited with a non-zero status of 1.
What I recommend from this point onward is:
The next thing you should do is install Google Cloud SDK. I managed to use my existing installation, but R was not able to use it. So I reinstalled the Google Cloud SDK through Rstudio.
install.packages("cloudml") library(cloudml) gcloud_install()
You can go into the Google Cloud terminal by using the command gcloud_init(). However, I keep getting the following error.
Failed to start terminal: ‘system error 126 (The specified module could not be found)’
It drove me nuts for a while, but I circumvented this step by basically just opening the Google Cloud SDK Shell myself. You should run gcloud init to get started and select a project. You should also login through gcloud auth login.
Next: you should move the data set and training script (see previous blog post) to a new directory within the current working directory. That’s because, once you submit your job to GCP, all files in your working directory will be passed along it. This might contain references to R packages that are not available in GCP and you will get the following error:
Error: Unable to retrieve package records for the following packages
Since you will be deploying the training script to a server somewhere in the world, it should be able to run stand alone. Loading, preprocessing, etc should all be in the training script.
Next up, setting the parameters for the job. This is done through a YAML file, that you can also create and edit in RStudio. Once again you can hypertune all the parameters that are set as a flag in the training script. You can find a complete list of possible parameters and their values on this reference page.
trainingInput: scaleTier: CUSTOM masterType: standard_gpu hyperparameters: goal: MINIMIZE hyperparameterMetricTag: val_loss maxTrials: 10 maxParallelTrials: 2 params: - parameterName: dropout1 type: DOUBLE minValue: 0.3 maxValue: 0.6 scaleType: UNIT_LINEAR_SCALE - parameterName: neurons1 type: DISCRETE discreteValues: - 128 - 256 - parameterName: neurons2 type: DISCRETE discreteValues: - 128 - 256 - parameterName: neurons3 type: DISCRETE discreteValues: - 128 - 256 - parameterName: lr type: DOUBLE minValue: 0.0001 maxValue: 0.01 scaleType: UNIT_LINEAR_SCALE - parameterName: l2 type: DOUBLE minValue: 0.0001 maxValue: 0.01 scaleType: UNIT_LINEAR_SCALE
Finally, in R, change to the appropriate working directory and submit the job to AI Platform.
setwd('gcp') cloudml_train('nn_ht_gcp.R', config = 'tuning.yml')
Important: before submitting your training job, make sure that Cloud ML (AI Platform) can write logs, otherwise, when you submit a job it will prepare the whole shebang and eventually just tell you that “The replica master 0 exited with a non-zero status of 1.” You can do that by going to IAM in the GCP menu and check if the Cloud ML Service agent actually has the role of Logs Writer. If not, assign it that role.
Now you can go to GCP AI Platform and see if the job is active. By clicking on ‘View logs’, you’ll be able to access stackdriver logs and monitor what’s going on in the back.