How to read OSM data with DuckDB

Author:Murphy  |  View: 23914  |  Time: 2025-03-22 22:38:35
Dall-E 3 image: Adorable and cute 3D render duck studying a paper map, bright sky, with blur background, high quality, 8k

This article will provide an in-depth look at how to read OpenStreetMap data using the Duckdb database.

The steps described in this guide will allow the reader to load the OSM data using the Monaco example divided into nodes, ways, and relations.

The final result of OSM elements read using the DuckDB engine. From the left: nodes, ways and relations. Generated by the author using GeoPandas library.

A basic knowledge of the SQL language is expected to fully understand the steps described in this tutorial. Most of the GIS-related operations and special joins are described further in the article.

Outline of the article

  • What is OSM? – introduction to the OSM service.
  • Openstreetmap data model – definition of objects used in the OSM.
  • Reading OSM data – basic operations on the data using DuckDB.
  • Constructing point geometries from the nodes
  • Constructing linestring and polygon geometries from the ways
  • Constructing polygon and multi-polygon geometries from the relations
  • Examples of badly defined relation objects
  • QuackOSM – a hassle-free tool for reading OSM data

What is OSM?

OpenStreetMap (OSM) is the most popular free map of the world and it's kept alive by a growing base of volunteers and contributors.

The data collected and built by the community is available publicly for free and commercial purposes, so many companies, academic researchers and individual developers use this resource in their projects. All data is provided under the Open Data Commons Open Database License (ODbL).

The data can be accessed in multiple ways:

The most space-efficient file type in which the data is stored is Protocolbuffer Binary Format with an extension *.osm.pbf . You can read more about it here.

You can also read this short article about OpenStreetMap from Eugenia Anello:

A comprehensive guide for getting started with OpenStreetMap

OpenStreetMap data model

This section is based on OSM Wiki page about Elements

Conceptually the data in OpenStreetMap is split into 3 components:

Nodes represent points in space. They are represented by a pair of coordinates in a WGS84 Coordinate Reference System – longitude and latitude. Nodes can be used to define a single feature on a map (eg. bench, lamp post, tree) or be used with other nodes to represent a shape of a way.

Example of a node – a tree in the park. Screenshot from the OpenStreetMap by the author.

Ways are shapes representing a polyline by using an ordered list of nodes. Those polylines can be open and represent roads, canals, and walls, or they can be closed to form a polygon and represent buildings, forests, lakes or other simple shapes.

Example of a way – a part of a highway road. Screenshot from the OpenStreetMap by the author.

Relations represent relationships between multiple objects and data elements defined in the OSM. This could be for example a bus route with ways showing roads on which the bus travels and nodes showing stops of a route, or a multi polygon with holes represented by at least 2 ways: these can be an outer polygon and an inner polygon.

An example of a relation— a hotel building outline with holes. Screenshot from the OpenStreetMap by the author.

Each element can, but doesn't have to, have tags attached. Tags describe the meaning of the element. They are composed of a key and a value. There is no fixed dictionary of those values, but users should stay within conventions documented in the OSM Wiki.

Additionally, each element has an ID that is unique in a given element type space (so there can be a node with ID = 100, a way with ID = 100 and a relation with ID = 100).

Reading OSM data

Many tools allow users to transform the OSM data model to file formats commonly used in the GIS domain, such as GDAL. These tools are automatically reconstructing geometries from the raw data. We will try to read it and reconstruct geometries manually.

Examples below show how to access raw data and are written in SQL using DuckDB engine with Spatial extension. All queries with a full Jupyer notebook can be accessed in the GitHub repository. You can run the notebook in the parallel or you can install the DuckDB engine and open the CLI to execute the queries there.

medium-articles/articles/osm-duckdb/code.ipynb at main · RaczeQ/medium-articles

Getting the data

For simplicity and easy access, examples are focused entirely on the Monaco region. You can download the current extract from the Geofabrik download server: https://download.geofabrik.de/europe/monaco.html (and click monaco-latest.osm.pbf download link)

Familiarisation with the data structure

To start, we will use the DESCRIBEfunction to get information about the columns:

DESCRIBE SELECT * FROM ST_READOSM('monaco-latest.osm.pbf');

┌─────────────┬──────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │                 column_type                  │  null   │   key   │ default │  extra  │
│   varchar   │                   varchar                    │ varchar │ varchar │ varchar │ varchar │
├─────────────┼──────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ kind        │ ENUM('node', 'way', 'relation', 'changeset') │ YES     │ NULL    │ NULL    │ NULL    │
│ id          │ BIGINT                                       │ YES     │ NULL    │ NULL    │ NULL    │
│ tags        │ MAP(VARCHAR, VARCHAR)                        │ YES     │ NULL    │ NULL    │ NULL    │
│ refs        │ BIGINT[]                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ lat         │ DOUBLE                                       │ YES     │ NULL    │ NULL    │ NULL    │
│ lon         │ DOUBLE                                       │ YES     │ NULL    │ NULL    │ NULL    │
│ ref_roles   │ VARCHAR[]                                    │ YES     │ NULL    │ NULL    │ NULL    │
│ ref_types   │ ENUM('node', 'way', 'relation')[]            │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴──────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘

There are 8 columns returned by the ST_READOSMfunction:

  • kind – this is the type of an element. It can also have a value of changesetrepresenting the changes after editing the existing element in the OSM. Extracts from the Geofabrik download server don't contain changesets, so we don't have to think about them.
  • id – the element identifier.
  • tags – map (or a dictionary) of two strings: a tag key and a tag value.
  • refs – list of member IDs related to the element. Nodes should have this list empty and ways and relations can't have it empty.
  • lat and lon – latitude and longitude of a node. Ways and relations should have these fields empty since only nodes can have coordinates in the OSM.
  • ref_roles and ref_types – a list of additional information about members: what role is assigned to the member and what type is it (node, way or relation).

Counting the elements

Let's see how many elements there are in total and per element type.

SELECT COUNT(*) as total_elements
FROM ST_READOSM('monaco-latest.osm.pbf');

┌────────────────┐
│ total_elements │
│     int64      │
├────────────────┤
│          35782 │
└────────────────┘

SELECT kind, COUNT(*) as total_elements
FROM ST_READOSM('monaco-latest.osm.pbf')
GROUP BY 1;

┌──────────────────────────────────────────────┬────────────────┐
│                     kind                     │ total_features │
│ enum('node', 'way', 'relation', 'changeset') │     int64      │
├──────────────────────────────────────────────┼────────────────┤
│ node                                         │          30643 │
│ way                                          │           4849 │
│ relation                                     │            290 │
└──────────────────────────────────────────────┴────────────────┘

Looking at the elements

Let's check the examples of the data for each element type.

SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'node'
LIMIT 5;

┌──────────────────────┬──────────┬─────────────────────────────┬─────────┬────────────────────┬────────────────────┬───────────┬───────────────────────────────────┐
│         kind         │    id    │            tags             │  refs   │        lat         │        lon         │ ref_roles │             ref_types             │
│ enum('node', 'way'...  │  int64   │    map(varchar, varchar)    │ int64[] │       double       │       double       │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼──────────┼─────────────────────────────┼─────────┼────────────────────┼────────────────────┼───────────┼───────────────────────────────────┤
│ node                 │ 21911883 │                             │         │ 43.737117500000004 │  7.422909300000001 │           │                                   │
│ node                 │ 21911886 │ {crossing=zebra, crossing...  │         │ 43.737239900000006 │  7.423498500000001 │           │                                   │
│ node                 │ 21911894 │                             │         │ 43.737773100000005 │          7.4259193 │           │                                   │
│ node                 │ 21911901 │                             │         │ 43.737762100000005 │ 7.4267099000000005 │           │                                   │
│ node                 │ 21911908 │                             │         │         43.7381906 │           7.426961 │           │                                   │
└──────────────────────┴──────────┴─────────────────────────────┴─────────┴────────────────────┴────────────────────┴───────────┴───────────────────────────────────┘

SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'way'
LIMIT 5;

┌──────────────────────┬─────────┬──────────────────────┬─────────────────────────────────────────┬────────┬────────┬───────────┬───────────────────────────────────┐
│         kind         │   id    │         tags         │                  refs                   │  lat   │  lon   │ ref_roles │             ref_types             │
│ enum('node', 'way'...  │  int64  │ map(varchar, varch...  │                 int64[]                 │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼─────────┼──────────────────────┼─────────────────────────────────────────┼────────┼────────┼───────────┼───────────────────────────────────┤
│ way                  │ 4097656 │ {highway=secondary...  │ [21912089, 7265761724, 1079750744, 21...  │        │        │           │                                   │
│ way                  │ 4098197 │ {highway=tertiary,...  │ [21918500, 10723812919, 1204288480, 2...  │        │        │           │                                   │
│ way                  │ 4224972 │ {highway=residenti...  │ [25177418, 4939779378, 4939779381, 49...  │        │        │           │                                   │
│ way                  │ 4224978 │ {access=no, addr:c...  │ [25178088, 25178091, 25178045, 251780...  │        │        │           │                                   │
│ way                  │ 4226740 │ {highway=secondary...  │ [25192130, 6444966483, 1737389127, 64...  │        │        │           │                                   │
└──────────────────────┴─────────┴──────────────────────┴─────────────────────────────────────────┴────────┴────────┴───────────┴───────────────────────────────────┘

SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'relation'
LIMIT 5;

┌──────────────────────┬───────┬──────────────────────┬──────────────────────┬────────┬────────┬──────────────────────┬─────────────────────────────────────────────┐
│         kind         │  id   │         tags         │         refs         │  lat   │  lon   │      ref_roles       │                  ref_types                  │
│ enum('node', 'way'...  │ int64 │ map(varchar, varch...  │       int64[]        │ double │ double │      varchar[]       │      enum('node', 'way', 'relation')[]      │
├──────────────────────┼───────┼──────────────────────┼──────────────────────┼────────┼────────┼──────────────────────┼─────────────────────────────────────────────┤
│ relation             │  7385 │ {ISO3166-2=FR-06, ...  │ [1701090139, 32665...  │        │        │ [admin_centre, lab...  │ [node, node, way, way, way, way, way, way...  │
│ relation             │  8654 │ {ISO3166-2=FR-PAC,...  │ [26761400, 1251610...  │        │        │ [admin_centre, lab...  │ [node, node, way, way, way, way, way, way...  │
│ relation             │ 11980 │ {ISO3166-1=FR, ISO...  │ [17807753, 1363947...  │        │        │ [admin_centre, lab...  │ [node, node, relation, relation, relation...  │
│ relation             │ 36990 │ {ISO3166-1=MC, ISO...  │ [1790048269, 77077...  │        │        │ [admin_centre, out...  │ [node, way, way, way, way, way, way, way,...  │
│ relation             │ 90352 │ {admin_level=2, bo...  │ [770774507, 377944...  │        │        │ [outer, outer, out...  │ [way, way, way, way, way, way, way, way, ...  │
└──────────────────────┴───────┴──────────────────────┴──────────────────────┴────────┴────────┴──────────────────────┴─────────────────────────────────────────────┘

Now we can see how the elements are defined: nodes have coordinates, ways have refs lists filled with node IDs and relations have the most complicated structure with refs lists filled with IDs and _reftypes lists showing which ID correspond to which element type. Additionally, _refroles have information about the role of the member (admin_centre, label, inner, outer).

Constructing point geometries from the nodes

Now that we know what the structure looks like, we can start building some geometries. Starting with nodes should be the easiest since it's just a pair of latitudes and longitudes in the WGS84 Coordinate Reference System.

We should only extract nodes with at least one tag attached since those have any semantical meaning for analytical purposes. Nodes without any tags could be used to construct ways in the later stages.

SELECT
    id,
    tags,
    ST_POINT(lon, lat) geometry
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'node'
    AND tags IS NOT NULL
    AND cardinality(tags) > 0;

┌─────────────┬─────────────────────────────────────────────────────────┬──────────────────────────────────────────────┐
│     id      │                          tags                           │                   geometry                   │
│    int64    │                  map(varchar, varchar)                  │                   geometry                   │
├─────────────┼─────────────────────────────────────────────────────────┼──────────────────────────────────────────────┤
│    21911886 │ {crossing=zebra, crossing:island=no, crossing_ref=zeb...  │ POINT (7.423498500000001 43.737239900000006) │
│    21912962 │ {crossing=zebra, crossing_ref=zebra, highway=crossing}  │ POINT (7.426912100000001 43.737912800000004) │
│    21914341 │ {crossing=uncontrolled, crossing_ref=zebra, highway=c...  │ POINT (7.4233732 43.737010000000005)         │
│    21915639 │ {highway=traffic_signals}                               │ POINT (7.4256003 43.7404449)                 │
│    21917308 │ {bus=yes, name=Monte-Carlo (Casino), public_transport...  │ POINT (7.4259854 43.740984700000006)         │
│    21918329 │ {crossing=marked, crossing:markings=yes, highway=cros...  │ POINT (7.427889100000001 43.7423616)         │
│    21918589 │ {crossing=marked, crossing:island=yes, crossing:marki...  │ POINT (7.4317478 43.7472774)                 │
│    21918697 │ {crossing=marked, crossing:island=no, crossing:markin...  │ POINT (7.432645000000001 43.747892900000004) │
│    21918939 │ {bus=yes, name=Portier, public_transport=stop_position} │ POINT (7.430429800000001 43.741472800000004) │
│    21919093 │ {crossing=marked, crossing:markings=yes, crossing_ref...  │ POINT (7.4352171 43.748160000000006)         │
│        ·    │                   ·                                     │                  ·                           │
│        ·    │                   ·                                     │                  ·                           │
│        ·    │                   ·                                     │                  ·                           │
│ 11450012980 │ {direction=forward, highway=give_way}                   │ POINT (7.416853100000001 43.735809700000004) │
│ 11450012981 │ {direction=forward, highway=give_way}                   │ POINT (7.416783000000001 43.735872900000004) │
│ 11450012982 │ {direction=forward, highway=give_way}                   │ POINT (7.416664900000001 43.735887000000005) │
│ 11450012983 │ {direction=forward, highway=give_way}                   │ POINT (7.4166968 43.7356945)                 │
│ 11451922579 │ {bus=yes, highway=bus_stop, name=Larvotto, public_tra...  │ POINT (7.435032400000001 43.7481639)         │
│ 11451922580 │ {bus=yes, highway=bus_stop, name=Grimaldi Forum, publ...  │ POINT (7.4311343 43.7442067)                 │
│ 11451922581 │ {bench=yes, bus=yes, highway=bus_stop, name=Portier, ...  │ POINT (7.430357300000001 43.742209100000004) │
│ 11451922582 │ {bench=no, bin=no, bus=yes, highway=bus_stop, lit=yes...  │ POINT (7.4107674 43.7296193)                 │
│ 11451922600 │ {direction=backward, highway=give_way}                  │ POINT (7.4105622 43.7291648)                 │
│ 11452057060 │ {direction=backward, highway=give_way}                  │ POINT (7.419164 43.737116)                   │
├─────────────┴─────────────────────────────────────────────────────────┴──────────────────────────────────────────────┤
│ 3167 rows (20 shown)                                                                                       3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Nodes plotted on a map. Generated by the author using GeoPandas library.

After filtering the nodes based on tags, we are left with 3167 points. It's around 10% of the total number of nodes in the source file.

Constructing linestring and polygon geometries from the ways

With nodes out of the way

Tags: Duckdb Geospatial Geospatial Data GIS Openstreetmap

Comment