Home ยป Using Keras in R: Submitting a job to AI Platform

Using Keras in R: Submitting a job to AI Platform

  • by
keras in gcp through R
Want to do a random act of kindness? Share this post.

What’s really cool is that Keras in R now has an interface to AI Platform (fka Cloud ML) in Google Cloud Platform. You do some trial-and-error on your computer and once you’re good to go, you can submit your job with a couple lines of code.

Just to be clear: you need a working project in Google Cloud. You can get a free $300 to spend if you’ve never registered for an account. If that’s done, go to AI Platform in the GCP menu and follow the instructions to enable this component. If you got that set up, you can proceed to the rest of this blog.

Compatibility Issues

There is a compatibility issue with R versions > 3.5. If you see the following error in the logs of your job, it means you use R version 3.6.0 on your local computer. This version uses a non-backward compatible way of saving RDS files. Once this RDS file is uploaded to AI Platform, it’s unreadable, since version 3.5.0 is used in there.

Error in readRDS(“cloudml/deploy.rds”) :
cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
[…]
The replica master 0 exited with a non-zero status of 1.

What I recommend from this point onward is:

The next thing you should do is install Google Cloud SDK. I managed to use my existing installation, but R was not able to use it. So I reinstalled the Google Cloud SDK through Rstudio.

install.packages("cloudml")
library(cloudml)
gcloud_install()

You can go into the Google Cloud terminal by using the command gcloud_init(). However, I keep getting the following error.

Failed to start terminal: ‘system error 126 (The specified module could not be found)’

It drove me nuts for a while, but I circumvented this step by basically just opening the Google Cloud SDK Shell myself. You should run gcloud init to get started and select a project. You should also login through gcloud auth login.

Next: you should move the data set and training script (see previous blog post) to a new directory within the current working directory. That’s because, once you submit your job to GCP, all files in your working directory will be passed along it. This might contain references to R packages that are not available in GCP and you will get the following error:

Error: Unable to retrieve package records for the following packages

Since you will be deploying the training script to a server somewhere in the world, it should be able to run stand alone. Loading, preprocessing, etc should all be in the training script.

Next up, setting the parameters for the job. This is done through a YAML file, that you can also create and edit in RStudio. Once again you can hypertune all the parameters that are set as a flag in the training script. You can find a complete list of possible parameters and their values on this reference page.

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  hyperparameters:
    goal: MINIMIZE
    hyperparameterMetricTag: val_loss
    maxTrials: 10
    maxParallelTrials: 2
    params:
      - parameterName: dropout1
        type: DOUBLE
        minValue: 0.3
        maxValue: 0.6
        scaleType: UNIT_LINEAR_SCALE
      - parameterName: neurons1
        type: DISCRETE
        discreteValues:
        - 128
        - 256
      - parameterName: neurons2
        type: DISCRETE
        discreteValues:
        - 128
        - 256
      - parameterName: neurons3
        type: DISCRETE
        discreteValues:
        - 128
        - 256
      - parameterName: lr
        type: DOUBLE
        minValue: 0.0001
        maxValue: 0.01
        scaleType: UNIT_LINEAR_SCALE
      - parameterName: l2
        type: DOUBLE
        minValue: 0.0001
        maxValue: 0.01
        scaleType: UNIT_LINEAR_SCALE

Finally, in R, change to the appropriate working directory and submit the job to AI Platform.

setwd('gcp')
cloudml_train('nn_ht_gcp.R', config = 'tuning.yml')

Enable logs in your GCP project

Important: before submitting your training job, make sure that Cloud ML (AI Platform) can write logs, otherwise, when you submit a job it will prepare the whole shebang and eventually just tell you that “The replica master 0 exited with a non-zero status of 1.” You can do that by going to IAM in the GCP menu and check if the Cloud ML Service agent actually has the role of Logs Writer. If not, assign it that role.

Now you can go to GCP AI Platform and see if the job is active. By clicking on ‘View logs’, you’ll be able to access stackdriver logs and monitor what’s going on in the back.

Want to do a random act of kindness? Share this post.