How to read OSM data with DuckDB

This article will provide an in-depth look at how to read OpenStreetMap data using the Duckdb database.
The steps described in this guide will allow the reader to load the OSM data using the Monaco example divided into nodes, ways, and relations.



A basic knowledge of the SQL language is expected to fully understand the steps described in this tutorial. Most of the GIS-related operations and special joins are described further in the article.
Outline of the article
- What is OSM? – introduction to the OSM service.
- Openstreetmap data model – definition of objects used in the OSM.
- Reading OSM data – basic operations on the data using DuckDB.
- Constructing point geometries from the nodes
- Constructing linestring and polygon geometries from the ways
- Constructing polygon and multi-polygon geometries from the relations
- Examples of badly defined relation objects
- QuackOSM – a hassle-free tool for reading OSM data
What is OSM?
OpenStreetMap (OSM) is the most popular free map of the world and it's kept alive by a growing base of volunteers and contributors.
The data collected and built by the community is available publicly for free and commercial purposes, so many companies, academic researchers and individual developers use this resource in their projects. All data is provided under the Open Data Commons Open Database License (ODbL).
The data can be accessed in multiple ways:
- Using Overpass API (with Web GUI at Overpass Turbo)
- Downloading full data as Planet OSM (currently over 70 GB in 2024)
- Smaller extract downloads: Geofabrik, BBBike, OpenStreetMap.fr, Protomaps
The most space-efficient file type in which the data is stored is Protocolbuffer Binary Format with an extension *.osm.pbf
. You can read more about it here.
You can also read this short article about OpenStreetMap from Eugenia Anello:
A comprehensive guide for getting started with OpenStreetMap
OpenStreetMap data model
This section is based on OSM Wiki page about Elements
Conceptually the data in OpenStreetMap is split into 3 components:
Nodes represent points in space. They are represented by a pair of coordinates in a WGS84 Coordinate Reference System – longitude and latitude. Nodes can be used to define a single feature on a map (eg. bench, lamp post, tree) or be used with other nodes to represent a shape of a way.

Ways are shapes representing a polyline by using an ordered list of nodes. Those polylines can be open and represent roads, canals, and walls, or they can be closed to form a polygon and represent buildings, forests, lakes or other simple shapes.

Relations represent relationships between multiple objects and data elements defined in the OSM. This could be for example a bus route with ways showing roads on which the bus travels and nodes showing stops of a route, or a multi polygon with holes represented by at least 2 ways: these can be an outer polygon and an inner polygon.

Each element can, but doesn't have to, have tags attached. Tags describe the meaning of the element. They are composed of a key and a value. There is no fixed dictionary of those values, but users should stay within conventions documented in the OSM Wiki.
Additionally, each element has an ID that is unique in a given element type space (so there can be a node with ID = 100, a way with ID = 100 and a relation with ID = 100).
Reading OSM data
Many tools allow users to transform the OSM data model to file formats commonly used in the GIS domain, such as GDAL. These tools are automatically reconstructing geometries from the raw data. We will try to read it and reconstruct geometries manually.
Examples below show how to access raw data and are written in SQL using DuckDB engine with Spatial extension. All queries with a full Jupyer notebook can be accessed in the GitHub repository. You can run the notebook in the parallel or you can install the DuckDB engine and open the CLI to execute the queries there.
medium-articles/articles/osm-duckdb/code.ipynb at main · RaczeQ/medium-articles
Getting the data
For simplicity and easy access, examples are focused entirely on the Monaco region. You can download the current extract from the Geofabrik download server: https://download.geofabrik.de/europe/monaco.html (and click monaco-latest.osm.pbf
download link)
Familiarisation with the data structure
To start, we will use the DESCRIBE
function to get information about the columns:
DESCRIBE SELECT * FROM ST_READOSM('monaco-latest.osm.pbf');
┌─────────────┬──────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼──────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ kind │ ENUM('node', 'way', 'relation', 'changeset') │ YES │ NULL │ NULL │ NULL │
│ id │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ tags │ MAP(VARCHAR, VARCHAR) │ YES │ NULL │ NULL │ NULL │
│ refs │ BIGINT[] │ YES │ NULL │ NULL │ NULL │
│ lat │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ lon │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ ref_roles │ VARCHAR[] │ YES │ NULL │ NULL │ NULL │
│ ref_types │ ENUM('node', 'way', 'relation')[] │ YES │ NULL │ NULL │ NULL │
└─────────────┴──────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
There are 8 columns returned by the ST_READOSM
function:
- kind – this is the type of an element. It can also have a value of
changeset
representing the changes after editing the existing element in the OSM. Extracts from the Geofabrik download server don't contain changesets, so we don't have to think about them. - id – the element identifier.
- tags – map (or a dictionary) of two strings: a tag key and a tag value.
- refs – list of member IDs related to the element. Nodes should have this list empty and ways and relations can't have it empty.
- lat and lon – latitude and longitude of a node. Ways and relations should have these fields empty since only nodes can have coordinates in the OSM.
- ref_roles and ref_types – a list of additional information about members: what role is assigned to the member and what type is it (node, way or relation).
Counting the elements
Let's see how many elements there are in total and per element type.
SELECT COUNT(*) as total_elements
FROM ST_READOSM('monaco-latest.osm.pbf');
┌────────────────┐
│ total_elements │
│ int64 │
├────────────────┤
│ 35782 │
└────────────────┘
SELECT kind, COUNT(*) as total_elements
FROM ST_READOSM('monaco-latest.osm.pbf')
GROUP BY 1;
┌──────────────────────────────────────────────┬────────────────┐
│ kind │ total_features │
│ enum('node', 'way', 'relation', 'changeset') │ int64 │
├──────────────────────────────────────────────┼────────────────┤
│ node │ 30643 │
│ way │ 4849 │
│ relation │ 290 │
└──────────────────────────────────────────────┴────────────────┘
Looking at the elements
Let's check the examples of the data for each element type.
SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'node'
LIMIT 5;
┌──────────────────────┬──────────┬─────────────────────────────┬─────────┬────────────────────┬────────────────────┬───────────┬───────────────────────────────────┐
│ kind │ id │ tags │ refs │ lat │ lon │ ref_roles │ ref_types │
│ enum('node', 'way'... │ int64 │ map(varchar, varchar) │ int64[] │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼──────────┼─────────────────────────────┼─────────┼────────────────────┼────────────────────┼───────────┼───────────────────────────────────┤
│ node │ 21911883 │ │ │ 43.737117500000004 │ 7.422909300000001 │ │ │
│ node │ 21911886 │ {crossing=zebra, crossing... │ │ 43.737239900000006 │ 7.423498500000001 │ │ │
│ node │ 21911894 │ │ │ 43.737773100000005 │ 7.4259193 │ │ │
│ node │ 21911901 │ │ │ 43.737762100000005 │ 7.4267099000000005 │ │ │
│ node │ 21911908 │ │ │ 43.7381906 │ 7.426961 │ │ │
└──────────────────────┴──────────┴─────────────────────────────┴─────────┴────────────────────┴────────────────────┴───────────┴───────────────────────────────────┘
SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'way'
LIMIT 5;
┌──────────────────────┬─────────┬──────────────────────┬─────────────────────────────────────────┬────────┬────────┬───────────┬───────────────────────────────────┐
│ kind │ id │ tags │ refs │ lat │ lon │ ref_roles │ ref_types │
│ enum('node', 'way'... │ int64 │ map(varchar, varch... │ int64[] │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼─────────┼──────────────────────┼─────────────────────────────────────────┼────────┼────────┼───────────┼───────────────────────────────────┤
│ way │ 4097656 │ {highway=secondary... │ [21912089, 7265761724, 1079750744, 21... │ │ │ │ │
│ way │ 4098197 │ {highway=tertiary,... │ [21918500, 10723812919, 1204288480, 2... │ │ │ │ │
│ way │ 4224972 │ {highway=residenti... │ [25177418, 4939779378, 4939779381, 49... │ │ │ │ │
│ way │ 4224978 │ {access=no, addr:c... │ [25178088, 25178091, 25178045, 251780... │ │ │ │ │
│ way │ 4226740 │ {highway=secondary... │ [25192130, 6444966483, 1737389127, 64... │ │ │ │ │
└──────────────────────┴─────────┴──────────────────────┴─────────────────────────────────────────┴────────┴────────┴───────────┴───────────────────────────────────┘
SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'relation'
LIMIT 5;
┌──────────────────────┬───────┬──────────────────────┬──────────────────────┬────────┬────────┬──────────────────────┬─────────────────────────────────────────────┐
│ kind │ id │ tags │ refs │ lat │ lon │ ref_roles │ ref_types │
│ enum('node', 'way'... │ int64 │ map(varchar, varch... │ int64[] │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼───────┼──────────────────────┼──────────────────────┼────────┼────────┼──────────────────────┼─────────────────────────────────────────────┤
│ relation │ 7385 │ {ISO3166-2=FR-06, ... │ [1701090139, 32665... │ │ │ [admin_centre, lab... │ [node, node, way, way, way, way, way, way... │
│ relation │ 8654 │ {ISO3166-2=FR-PAC,... │ [26761400, 1251610... │ │ │ [admin_centre, lab... │ [node, node, way, way, way, way, way, way... │
│ relation │ 11980 │ {ISO3166-1=FR, ISO... │ [17807753, 1363947... │ │ │ [admin_centre, lab... │ [node, node, relation, relation, relation... │
│ relation │ 36990 │ {ISO3166-1=MC, ISO... │ [1790048269, 77077... │ │ │ [admin_centre, out... │ [node, way, way, way, way, way, way, way,... │
│ relation │ 90352 │ {admin_level=2, bo... │ [770774507, 377944... │ │ │ [outer, outer, out... │ [way, way, way, way, way, way, way, way, ... │
└──────────────────────┴───────┴──────────────────────┴──────────────────────┴────────┴────────┴──────────────────────┴─────────────────────────────────────────────┘
Now we can see how the elements are defined: nodes have coordinates, ways have refs lists filled with node IDs and relations have the most complicated structure with refs lists filled with IDs and _reftypes lists showing which ID correspond to which element type. Additionally, _refroles have information about the role of the member (admin_centre, label, inner, outer).
Constructing point geometries from the nodes
Now that we know what the structure looks like, we can start building some geometries. Starting with nodes should be the easiest since it's just a pair of latitudes and longitudes in the WGS84 Coordinate Reference System.
We should only extract nodes with at least one tag attached since those have any semantical meaning for analytical purposes. Nodes without any tags could be used to construct ways in the later stages.
SELECT
id,
tags,
ST_POINT(lon, lat) geometry
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'node'
AND tags IS NOT NULL
AND cardinality(tags) > 0;
┌─────────────┬─────────────────────────────────────────────────────────┬──────────────────────────────────────────────┐
│ id │ tags │ geometry │
│ int64 │ map(varchar, varchar) │ geometry │
├─────────────┼─────────────────────────────────────────────────────────┼──────────────────────────────────────────────┤
│ 21911886 │ {crossing=zebra, crossing:island=no, crossing_ref=zeb... │ POINT (7.423498500000001 43.737239900000006) │
│ 21912962 │ {crossing=zebra, crossing_ref=zebra, highway=crossing} │ POINT (7.426912100000001 43.737912800000004) │
│ 21914341 │ {crossing=uncontrolled, crossing_ref=zebra, highway=c... │ POINT (7.4233732 43.737010000000005) │
│ 21915639 │ {highway=traffic_signals} │ POINT (7.4256003 43.7404449) │
│ 21917308 │ {bus=yes, name=Monte-Carlo (Casino), public_transport... │ POINT (7.4259854 43.740984700000006) │
│ 21918329 │ {crossing=marked, crossing:markings=yes, highway=cros... │ POINT (7.427889100000001 43.7423616) │
│ 21918589 │ {crossing=marked, crossing:island=yes, crossing:marki... │ POINT (7.4317478 43.7472774) │
│ 21918697 │ {crossing=marked, crossing:island=no, crossing:markin... │ POINT (7.432645000000001 43.747892900000004) │
│ 21918939 │ {bus=yes, name=Portier, public_transport=stop_position} │ POINT (7.430429800000001 43.741472800000004) │
│ 21919093 │ {crossing=marked, crossing:markings=yes, crossing_ref... │ POINT (7.4352171 43.748160000000006) │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 11450012980 │ {direction=forward, highway=give_way} │ POINT (7.416853100000001 43.735809700000004) │
│ 11450012981 │ {direction=forward, highway=give_way} │ POINT (7.416783000000001 43.735872900000004) │
│ 11450012982 │ {direction=forward, highway=give_way} │ POINT (7.416664900000001 43.735887000000005) │
│ 11450012983 │ {direction=forward, highway=give_way} │ POINT (7.4166968 43.7356945) │
│ 11451922579 │ {bus=yes, highway=bus_stop, name=Larvotto, public_tra... │ POINT (7.435032400000001 43.7481639) │
│ 11451922580 │ {bus=yes, highway=bus_stop, name=Grimaldi Forum, publ... │ POINT (7.4311343 43.7442067) │
│ 11451922581 │ {bench=yes, bus=yes, highway=bus_stop, name=Portier, ... │ POINT (7.430357300000001 43.742209100000004) │
│ 11451922582 │ {bench=no, bin=no, bus=yes, highway=bus_stop, lit=yes... │ POINT (7.4107674 43.7296193) │
│ 11451922600 │ {direction=backward, highway=give_way} │ POINT (7.4105622 43.7291648) │
│ 11452057060 │ {direction=backward, highway=give_way} │ POINT (7.419164 43.737116) │
├─────────────┴─────────────────────────────────────────────────────────┴──────────────────────────────────────────────┤
│ 3167 rows (20 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

After filtering the nodes based on tags, we are left with 3167 points. It's around 10% of the total number of nodes in the source file.
Constructing linestring and polygon geometries from the ways
With nodes out of the way