Caching in GitHub Actions

Author:Murphy | View: 28455 | Time: 2025-03-23 11:59:54

In this post we will discover how to cache GitHub Actions. Github Actions is a platform from Github with which one can automate workflows, and is commonly used for CI/CD (continuous integration / delivery) pipelines – e.g. to automatically run unit tests when wanting to merge a new PR. Since these pipelines are run frequently, and their execution time can grow significantly, it makes sense to see where to save time – and caching action outputs is one such method.

Photo by Possessed Photography on Unsplash

In this post we will cover said caching. I felt the official documentation is quite brief and leaves some questions unanswered – thus I here wanted to shed a bit more light into this. We begin by a short introduction to Github Actions and how caching works, and then demonstrate this using two examples: the first follows the original toy example about creating prime numbers, while the second one is more realistic – we cache a full Python environment.

Introduction to Github Actions

In a previous post I introduced this topic in more details – thus here we will just briefly cover this, and I would like to refer to the linked article for details. However, in summary Github Actions allows to automatize workflows, often used for CI/CD pipelines, e.g. for running unit tests, checking style guides etc. After receiving certain trigger events, runners (which can be hosted by Github or custom ones) pick up jobs consisting of different steps. Let's use an example from the previous post for demonstration:

name: Sample Workflow

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
jobs:
  sample_job:
    runs-on: ubuntu-20.04
    steps:
      - name: Checkout repo
        uses: actions/checkout@v3
      - name: Set up Python 3.10.0
        uses: actions/setup-python@v3
        with:
          python-version: "3.10.0"
      - name: Echo 1
        run: echo "Echo 1"
      - name: Echo 2
        run: |
          echo "Echo 2a"
          echo "Echo 2b"

Here, we define a workflow "Sample Workflow", and set code pushes and opening of new PRs as event triggers. The workflow then consists of a single job running on "ubuntu-20.04" – which is a freely available Git instance running said Ubuntu version. The job has different steps, which checkout the repo, setup Python, and output different messages.

In order for this to be run, we need to place it into the .github/workflows folder. Once there and pushed to Github, this workflow will automatically be run on the defined event triggers – and we can see status and output of the defined workflows conveniently, e.g. as such:

Caching Actions

With the foundations laid, let's move to caching. Via the cache action, we can define a cache step. Borrowing from our toy example to come, this can look like this:

- name: Cache Primes
id: cache-primes
uses: actions/cache@v3
with:
    path: prime-numbers
    key: ${{runner.os}}-primes

A cache is uniquely identified by a key, and a path: if both of these are present / filled, the cache is loaded – otherwise generated. When the workflow is run for the first time (or some dependencies change) and the cache is generated, the contents of the folder specified under path are uploaded to some Github storage. This way, the cache is independent of the runner the previous run was executed on, and always available (in particular, you do not need your own runner to persist the cache – but can use the publicly available Github runners).

Now, in a following step, we can check if the cache is available (cache hit), and skip it (typically the step generating the contents of the cache):

- name: Generate Prime Numbers
  if: steps.cache-primes.outputs.cache-hit != 'true'
  run: ./generate_primes.sh

Caching Generating Prime Numbers

With this being said, let's give a first full example: via Github Actions we generate the first N prime numbers, and cache this output. Note this is motivated by the original docs, albeit somewhat more complete in my opinion.

The example consists of two bash scripts. Via the first, generate_primes.sh, we generate the first N prime numbers and write them to prime-numbers/generate_primes.sh :

#!/bin/bash

N=10  # Number of prime numbers to generate
file_path="prime-numbers/primes.txt" # Path where to store the primes

# Remove existing file if it exists
rm -f "$file_path"

# Function to check if a number is prime
is_prime() {
    num=$1
    for ((i=2; i*i<=num; i++)); do
        if ((num % i == 0)); then
            return 1
        fi
    done
    return 0
}

# Create directory for prime numbers if it doesn't exist
mkdir -p "$(dirname "$file_path")"

echo "Generating prime numbers ..."

count=0
number=2
while [ $count -lt $N ]; do
    if is_prime $number; then
        echo $number >> prime-numbers/primes.txt
        ((count++))
    fi
    ((number++))
done

The other script, primes.sh, reads this file, and simply prints the stored prime numbers:

#!/bin/bash

# Read and print prime numbers from the file
if [ -f prime-numbers/primes.txt ]; then
    echo "Prime numbers:"
    cat prime-numbers/primes.txt
else
    echo "File prime-numbers/primes.txt not found."
fi

Assuming we want to find a lot of prime numbers, and that this takes a while, it lies close wanting to cache this process, which is exactly what we do with prime_workflow.yml:

name: Caching Primes

on: push

jobs:
  build:
    runs-on: ubuntu-20.04

    steps:
    - uses: actions/checkout@v3

    - name: Cache Primes
      id: cache-primes
      uses: actions/cache@v3
      with:
        path: prime-numbers
        key: ${{ runner.os }}-primes

    - name: Generate Prime Numbers
      if: steps.cache-primes.outputs.cache-hit != 'true'
      run: ./generate_primes.sh

    - name: Use Prime Numbers
      run: ./primes.sh

We checkout the repository, and in step 2 call the cache action: the key is composed of the runner OS and the suffix "-primes", and the cache path is the folder our first script dumps the resulting file into.

Then, we ask the workflow to generate the prime numbers (i.e. run generate_primes.sh) – if there is no cache hit, e.g. when executing this script for the first time.

Lastly, we use the generated or cached prime numbers in the script primes.sh . Looking at the second run of this workflow, we observe that indeed the "generate" step was skipped:

You can also find the full example on Github.

Caching Poetry Environments

Now, let's come to a slightly more realistic example: it is strongly recommended that any Python project is bundled with an environment, s.t. all developers are guaranteed to work with the same packages and versions. On Github Actions runners, the repository is mostly checked out freshly, meaning every time the set environment has to be installed anew. Thus, here we show how to cache this environment – meaning instead of downloading and installing the packages, the full environment is cached and downloaded from the cache. In particular, this example will show the usage of poetry, which I prefer for managing my environments.

The sample file main.py in this project looks like this:

import matplotlib.pyplot as plt
import numpy as np

def plot():
    x = np.linspace(0, 10, 50)
    y = np.sin(x)
    plt.plot(x, y)
    plt.savefig("plot.png")

if __name__ == "__main__":
    plot()

I.e., we plot a simple sinus curve using matplotlib, and thus need this and numpy.

Consequently, our pyproject.toml file for poetry contains the following (here, I am assuming basic knowledge of poetry – and otherwise would like to refer to the linked post):

[tool.poetry]
name = "myproject"
version = "0.1.0"
description = "..."
authors = ["hermanmichaels <[email protected]>"]

[tool.poetry.dependencies]
python = "3.10"
matplotlib = "3.5.1"
mypy = "0.910"
numpy = "1.22.3"
black = "22.3.0"

As we can see, we install the needed packages – as well as some other useful tools no Python project should miss.

Then, the corresponding Github Actions workflow to setup the environment (including caching) looks as follows:

name: Caching Env

on: push

jobs:
  build:
    runs-on: ubuntu-20.04

    steps:
    - uses: actions/checkout@v3

    - name: Set up Python 3.10.0
      uses: actions/setup-python@v3
      with:
        python-version: "3.10.0"

    - name: Install poetry
      run: curl -sSL https://install.python-poetry.org | python3 -

    - name: Cache Env
      id: cache-env
      uses: actions/cache@v3
      with:
        path: ~/.cache/pypoetry
        key: ${{ runner.os }}-env

    - name: Install poetry dependencies
      if: steps.cache-env.outputs.cache-hit != 'true'
      run: poetry install

We initially install Python and poetry. In the last step, we run poetry install, which installs all required packages into the poetry environment.

Key focus of this post is the second last step: in this, we define a cache with target path ~/.cache/pypoetry – which is where poetry environments are stored by default. Thus, if this workflow is run again, and this key / folder combination is present in the cache, we skip the poetry install – and instead download the full environment from the cloud cache.

Also this example is available in the same demo repository.

NOTE: there is a trade-off to this. In the version without caching, we download all packages and then install them. In the version employing the cache, we do not have to install anything, but instead need to download all installed packages from the cache. Whatever is faster should depend on various factors, such as bandwidth, package size and installation duration. Let me know in the comments if you have practical data points here, and how you prefer to handle this!

Managing Caches

Lastly, a word about monitoring and managing your caches: when opening your repository on Github, and navigating to "Actions / Caches", we see the following image:

All caches used in this repository are displayed, e.g. indicating their size, and we can remove them if needed.

Conclusion

This finishes our introduction to caching for Github Actions. This is an extremely useful feature, as CI/CD pipelines often have developers waiting impatiently – and by caching we can reduce the time these need to run, and also reduce load on the overall system.

After a general introduction to the topic, we showed how to apply caching via two concrete examples: we started with a toy example caching generated prime numbers, and then showed how to cache a poetry environment. This Github repository contains all sample code.

Thanks for reading!

Tags: Github github-actions Python Python Poetry Software Development