Using Github Actions as a job scheduler for R scripts
Static websites are a very cost-effective and performant way to deploy simple websites out in the open. Static websites, by definition, are static meaning their content remains mostly unchanged and is one of the simplest way to ship a website. Another benefit is that static sites can hosted on a simple file server, thus reducing costs and any security threats.
There are cases, however, where content or data in a static website needs to updated. We can do that manually every time, but who’s got the time for that?If you use Github to as your code repository, you can automate this process using a nifty feature Github Actions. We will explore how Github Actions can be used to periodically fetch new data, preform preprocessing using an R script and commit the changes automatically without provisioning any server and most importantly, its FREE!
We will use an example of a Covid-19 case visualizer I made in D3.js using data from John Hopkins University’s Covid-19 dataset available on Github.
If you see the repo, you can see that there is a folder called R which contains a script. Every time the script is run, it fetches the latest data from JHU dataset repo, transforms it for data visualization and outputs the result in a file.
The goal here is to periodically run this script so that a relatively fresh version of data is present in our repository which the users can then see during visualization.
In a nutshell, Github Actions allow you to run workflows on your Git repository to perform tasks such as CI/CD or run unit-tests. Under the hood, Github provisions an isolated server runtime using a YAML configuration file located in the .github/workflows folder of the project. This YAML file specifies the type of environments needed and the jobs to be run. The full details of the syntax and reference can be found in the documentation.
The YAML file used in our project is follows:
Let’s break this down and see what each section is doing.
- name: This is the name of the workflow that will show up in Actions tab.
- on.schedule: This tells Github to run this job on a schedule, it uses a standard cron syntax. Other options are on push, pull etc.
- jobs.<job-name>: The name of the job. There can be multiple jobs in a single workflow.
- jobs.<job-name>.runs-on: The environment of the runner.
- jobs.<job-name>.steps: These are the list of commands or actions that will run in sequential order.
Note that steps can take either a ‘run’ key which simply executes the command in the shell, or a ‘uses’ key which allows us to use actions as a step. According to the documentation, the ‘uses’ key:
“Selects an action to run as part of a step in your job. An action is a reusable unit of code. You can use an action defined in the same repository as the workflow, a public repository, or in a published Docker container image.”
Below we explain what each entry in the steps field is doing:
- Checkout the current repository into the runner using “actions/checkout@v2”
- Install and set up R in the runner using the r-lib/actions/setup-r@master action.
- Install and setup renv. The shell key here tells the runner to run this command in the RScript {0} shell which was installed using setup-r executed previously.
- Cache the installed packages in renv path so that they are not reinstalled every time the job is run. This helps cut down job running time by a lot.
- Runs renv::restore() command to install the packages from renv.lock file.
- Run our data processing script “R/data_prep.R”
- Commit the new files generated to our git repository. The “git diff-index” commands ensures to only commit if there are any changes and avoid the “Nothing to commit” message which the runner treats as an error.
And that’s it! Once the changes are committed, a Netlify webhook is triggered which automatically deploys the latest version to the web.