Statistical Plotting with Julia: AlgebraOfGraphics.jl

Author:Murphy  |  View: 25382  |  Time: 2025-03-23 18:52:11
Phyto by Antoine Dautry on Unsplash

The Grammar of Graphics (GoG) is a theoretical concept, which is the base of many popular graphics packages (like ggplot2 in R or ggplot in Python). Within the Julia ecosystem there are even several graphics packages based on the GoG. So the user has the choice. Therefore I've created this series of articles to compare these packages in order to make the choice easier.

I've started the series with an introduction to the GoG and already presented the graphics packages [Gadfly.jl](http://gadflyjl.org/stable/) (Statistical Plotting with Julia: Gadfly.jl) and [VegaLite.jl](https://www.queryverse.org/VegaLite.jl/stable/) (Statistical Plotting with Julia: VegaLite.jl).

The AlgebraOfGraphics.jl-package (AoG) is now the third graphics package based on the Grammar of Graphics (GoG) which I present in this lineup.

For the examples demonstrating AoG in this article, I will use the exact same data as in the previous articles (a detailed explanation of the data can be found here) and I will try to create the exact same visualizations (bar plots, scatter plots, histograms, box plots and violin plots) as I did there, in order to make a 1:1 comparison of all packages possible. I assume that the data for the examples is ready in the DataFrames countries, subregions_cum and regions_cum (as before).

AlgebraOfGraphics

The AoG-package is perhaps the purest implementation of the GoG so far, as we will see in the following examples. It is founded on sound mathematical concepts and its authors describe it as a "a declarative, question-driven language for data visualizations". Its main developer is Pietro Vertechi.

On a technical level it takes a completely different approach from the packages we've seen up to now: Whereas Gadfly.jl is a standalone graphics package, purely written in Julia and VegaLite.jl is a Julia-interface for the Vega-Lite graphics engine, AoG is an add-on package to [[Makie](https://docs.makie.org/stable/).jl](https://docs.makie.org/stable/) . Makie itself is the youngest graphics package within the Julia ecosystem (which is also completely written in Julia).

The boundaries between AoG and Makie are fluid. Several elements of AoG use Makie-attributes and Makie is always the fallback solution, if some aspects cannot be expressed using the concepts of AoG itself.

It should also be noted that AoG is still a work in progress. Version 0.1 appeared only in 2020. Therefore it is not as complete as the other, more mature packages and a few aspects simply don't work yet.

Bar plots

So let's jump into the first visualizations, which depict the population sizes of the regions (i.e. continents) and the subregions respectively using bar plots.

Population by region

First we want to show the population size (in 2019) for each region (i.e. continent) as a bar within the bar chart. Apart from that, each „region-bar" should have a different color.

Using this simple example, we can see how the basic concepts of AoG work: In GoG-terms, this visualization is based on data from the regions_cum DataFrame and it consists of:

  • a mapping of the data attribute Region to the x-axis
  • a mapping of the data attribute Pop2019 to the y-axis
  • a mapping of the data attribute Region to colors
  • use of the "bar" geometry

As I explained in the introduction to the GoG, one of its ideas is, that a specification of a visualization can be created from separate building blocks, which may be combined to specific needs. AoG has fully implemented this idea. Therefore we can translate the GoG description directly to AoG elements:

  • regionPop2xy = mapping(:Region, :Pop2019) is the mapping of Region to the x-axis and Pop2019 to the y-axis
  • region2color = mapping(color = :Region) is the mapping of Region to colors
  • barplot = visual(BarPlot) is the "bar" geometry

Now we can combine these building blocks (using the operator *), taking data from regions_cum and create the plot with a call to draw:

draw(data(regions_cum) * regionPop2xy * region2color * barplot)

This results in the following bar plot:

Population by region (1) [image by author]

As in the previous articles, we create also a beautified version of each visualization by adding labels, a title and a nice background color among other things. This can be done in AoG using the Makie-parameters axis and figure to draw:

<script src="https://gist.github.com/roland-KA/a2aaab550b58e40ff84644f34a19a15e.js"></script>

This leads to the following chart:

Population by region (2) [image by author]

Population by Subregion

Now let's move on to the visualization of the population by subregions. This is basically the same like the plots above, but we take the data from subregions_cum instead of regions_cum.

So our mapping to the axes is now subregionPop2xy = mapping(:Subregion, :Pop2019). As we want the bars for the subregions again colored by region, we can reuse the mapping from above and the basic plot can be drawn with:

draw(data(subregions_cum) * subregionPop2xy * region2color * barplot)

This produces the following plot:

Subregion by population (1) [image by author]

Obviously the subregion labels would be more readable if we chose a horizontal bar plot. This can be achieved by swapping the data attributes in the mapping to the axes: subregionPop2xy_hor = mapping(:Pop2019, :Subregion) and by adding orientation = :x to the visual. So the code to draw a horizontal version of this bar plot is:

draw(data(subregions_cum) * subregionPop2xy_hor * region2color * 
     visual(BarPlot; direction = :x))

This is unfortunately a specification where it becomes clear that AoG is still a work in progress. There must be some bug in the rendering process, because the result of this draw command looks as follows:

Subregion by population (2) [image by author]

The ticks on the y-axis as well as the bars are misplaced and the ticks on the x-axis are neither what we want.

Population by Subregion using Makie.jl

So we take this problem as an opportunity to switch to Makie.jl. Makie is a rather low level graphics package. Many things we get automatically in the packages we've seen so far, have to be specified explicitly in Makie. This gives the programmer a lot of control but makes the specifications quite verbose.

Another shortcoming is, that Makie cannot handle nominal data. All nominal data has to be converted to a numeric form before it can be visualized. In our case that means, that we have to convert the nominal data of the attributes Region and Subregion to numbers:

  • This is relatively easy for Subregion, because this attribute contains unique values. So we simply use the index values of that column of the DataFrame and store them in the new column subregion_num.
  • The Region values are not unique. Therefore we convert them first to a CategoricalArray which does implicitly a mapping to numeric values. We can obtain then the corresponding numbers using the function levelcode and store them in another new column region_num.

Apart from that, we chose an adequate color scheme (Set2_8) from ColorSchemes.jl in order to get nice and distinguishable colors for the regions. This scheme looks as follows:

The color scheme Set2_8 [image by author]

For all these preparations we need the following code:

<script src="https://gist.github.com/roland-KA/98d3ff5624357ca85cc91c94830d5f20.js"></script>

We will then directly create a "beautified" version of the bar plot with labels etc. In Makie we need a Figure as a base element, where the barplot can be placed. As Makie cannot handle nominal data, we also have to specify the ticks for the y-axis manually using the yticks attribute as we can see in the following code, which creates our horizontal bar plot:

<script src="https://gist.github.com/roland-KA/601ff5d4039b40fa0c792732798ffd30.js"></script>

This is a lot of code, but the result looks quite pleasing:

Population by subregion (3) [image by author]

In order to get a version of this bar plot where the subregions are sorted by population size, we have to sort the data in subregions_cum accordingly using sort!(subregions_cum, :Pop2019) and then execute the code above (including the mapping to numeric data) again. This leads to the following plot:

Population by subregion (4) [image by author]

Scatter Plots

After this excursion to Makie, we return back to AoG trying to visualize how population change depends on the size of the population. We can do this using a scatter plot as follows:

popChangeVsPop = data(countries) * 
      mapping(:Pop2019, :PopChangePct) * 
      mapping(color = :Region)
draw(popChangeVsPop)

The specification contains a mapping of Pop2019 to the x-axis and PopChangePct to the y-axis, as well as a mapping of Region to a color (we could have reused region2colorat this point, but it is also possible to specify a mapping directly). A visual can be omitted here, because the point geometry (Scatter) is used by default by AoG in this context. This gives us the following plot:

Growth rate in relation to population (1) [image by author]

As in the previous articles, we improve now the visualization by using a logarithmic scale on the x-axis as the data is quite skewed. In addition we do our "beautification" by adding labels, a title etc. All this can be achieved by reusing the plot specification popChangeVsPop and adding the aforementioned elements by passing adequate parameters to draw:

<script src="https://gist.github.com/roland-KA/852b1aff659e455dc3dce72a4479ef30.js"></script>

This leads to the following plot:

Growth rate in relation to population (2) [image by author]

Histograms

Now we switch to histograms which we use to depict the distribution of GDP per capita among the different countries. As AoG offers a so-called histogramanalysis, the specification is quite simple:

draw(data(countries) * mapping(:GDPperCapita) * histogram())

An analysis is in AoG a way to process data before visualizing it. And often the geometry (visual) depends directly on an analysis, as in this example, where a histogram automatically will be displayed using a bar geometry.

Distribution of GDP per capita (1) [image by author]

The creation of the histogram can be influenced by changing the number of bins (via the parameter bins) and by using different normalization algorithms. So we get an improved version by using the following specification:

<script src="https://gist.github.com/roland-KA/d2f0f631e9de51de5c6bd4bc005e7e27.js"></script>

This code shows again, how well AoG separates the specification of the visualization (histGDPperCapita) from its "beautification" (in the call to draw) leading to the following diagram:

Distribution of GDP per capita (2) [image by author]

Box Plots and Violin Plots

Finally we visualize the distribution of GDP per capita in each region using box plots and violin plots. This can be achieved with the same simplicity as above, since AoG offers specific geometries for both plot variants.

In order to maximize the reuse of elements, we first define the data and the mappings for the distribution (distGDPperCapita) and add then the geometry (using visual). As in all examples, the additional "beautification" can then be added using adequate parameters within the call to draw.

<script src="https://gist.github.com/roland-KA/31b9633ba170463c0a39abe835e73f02.js"></script>

This code creates the following two diagrams:

Distribution of GDP per capita by region (1) [image by author]
Distribution of GDP per capita by region (2) [image by author]

Zooming in

As the "most interesting" part in both diagrams lies in the range from 0 to 100,000 (on the y-axis), we want to restrict the plots to that range (doing sort of a zoom-in).

In AoG this is possible using the datalimits parameter for visual. But there seems to be another bug in AoG, since this parameter has the desired effect only when used on the violin plot, but it doesn't change anything when applied to the box plot.

So using the following specification …

violinRestricted = distGDPperCapita * 
                   visual(Violin; show_notch = true, datalimits = (0, 100000))
drawDist(violinRestricted)

… we get this diagram:

Distribution of GDP per capita by region (3) [image by author]

Conclusions

As mentioned above, the AoG-package is clearly the purest implementation of the Grammar Of Graphics we have seen in this series. It really separates mappings, geometries etc. into different building-blocks, which can then be combined using the * operator. It also separates clearly more "decorative" elements (all the things we called "beautification" above) from the visualization proper, thus making specifications even more modular and giving us more building-blocks which can be reused.

I think it is quite normal for such a young package to still have some rough edges, but it has really a sound foundation and looks quite promising. Of course it was not possible to show all the functionality of AoG in this article. So please have a look at the documentation, if you want to learn more about it. And last but not least it is also worth reading about the philosophy underlying this approach, which can be found here.

For those who want to dive deeper into the code, there is also a Pluto notebook containing all the examples shown above in my GitHub repository.

Tags: Data Science Data Visualization Grammar Of Graphics Julia Statistics

Comment