Spatial Data Engineering with Typescript

Author:Murphy  |  View: 24124  |  Time: 2025-03-23 12:47:44
Photo by T K on Unsplash

Introduction

We can think of data as if it were water and companies as towns. Just as a town grows as the population grows and needs more water to serve its residents, a company, as it gets bigger, requires a ready to consume data to support its operations. Such companies require a data pipeline system that acts like the pipes and infrastructure that bring water to homes in a town. In our data analogy, data engineers are the people who build and maintain these data pipelines. It is quite straight forward for regular array or table like data, however, it is a bit more complicated when it comes to spatial data.

Spatial data is a bit different compared to regular data; it contains spatial attributes. These attributes allow us to establish spatial relationship; also known as geospatial topology. Even though two tables don't have primary keys and foreign keys, we can still join them if they both have the spatial attributes. If we visualise the spatial attribute, we have a map!

Spatial Data Science: SQL Join Spatially

Building spatial data pipelines differs from creating ordinary data pipelines. In this case, we primarily handle spatial data attributes using spatial SQL, which is a less common skill within the data scientist community. Once a data pipe line is built, data analysts can proceed in analysing the spatial data from the data pipeline. This means data analyst can focus on generating spatial insights without having to worry about data availability. This also means that maps can be generated automatically with the latest spatial data.

This Article

This article discusses how to build spatial data pipelines using typescript and nodeJS. We can call this an ETL (Extract, Transform, Load) process but with spatial data. First, we will walk through how we can fetch the data from the source using typescript (Extract). We will then transform this data into the right structure for storage (Transform). Finally, we will operate spatial SQL to manage and store our data into our database (Load).

The development is done with typescript & Node.js, primarily using the node-postgres library. The demo code is available in the following link.

GitHub – sutanmufti/spatial-data-engineering-typescript: spatial data engineering with typescript…

Prerequisite

There are three prerequisites: you have Node.JS with typescript, and have an on demand postgis server. I am using Mac OS (UNIX) for this article, but this article is meant for any Linux or Unix like OS for servers.

Node.JS

Node.js is a JavaScript runtime that executes JavaScript code on the server. It functions similarly to Python, serving as an interpreter for JavaScript code. For this article, a basic understanding on Javasvript syntaxes suffices.

Download | Node.js

Typescript

Typescript is a superset of Javascript that adds type checking feature to javascript. This "type checking feature" restricts coding in javascript in return of lowering risks of bugs. The code we develop in typescript is compiled into javascript code. This javascript code is then executed by Node.JS. Typescript is optional, but I like having type check in my projects.

How to set up TypeScript

Postgis Server

You must have postgresql server running with Postgis installed. Postgis is a postgresql extension that allows us to handle and store geospatial data. This is how we can conduct spatial SQL. Of course, you must know fundamental SQL to operate Postgresql.

Spatial Data Science: Spatial Queries

You can have your own postgis server with Docker or install it directly in your PC. Another alternative is to deploy a cloud instance such as with Amazon RDS, Google Cloud SQL, or a general virtual machine with postgresql installed.

Docker

The Idea

We can break this job into several tasks:

  • Extract: fetching the data from the data source. For this article, the data source can be anything as long as we are receiving in GeoJSON format (RFC7946)! In this demo we will take the source from Stuart Grange‘s data that originates from the London Government's data store. We can retrieve other geospatial format such as shapefile but that means we must handle this with other libraries. For the sake of simplicity, let's stick with GeoJSON.
  • Transform: validating the data (again, not the focus of this article) and handle it if it is not valid. Then, we transform the raw data into the structure that will be used by the data analysts; or how we design our data environment.
  • Load: inserting the data into the table. This uses spatial SQL.

At the end of the article we can see our data being stored inside our postgis server. Let's get into the code.

Building The Data Pipeline

The main code is available in the following link. Let's break down the tasks and walk through the code.

GitHub – sutanmufti/spatial-data-engineering-typescript: spatial data engineering with typescript…

The main function is the following code. As you can see, it mainly comprises of 2 functions: the ExtractData and the transformAndLoad . Additionally, there is a constant variable named poolthat handles authentication for the PostGIS server.

async function main(){
    // this creates the pool connection to the postgis server
    const pool = new Pool({
        user: process.env.POSTGRES_USER,
        host: process.env.POSTGRES_HOST,
        database: process.env.POSTGRES_DB,
        password: process.env.POSTGRES_PASSWORD,
        port: Number(process.env.POSTGRES_PORT),
      });

    // this fetches the data
    const data = await ExtractData()
    // this transforms and load the data based on the Pool connection
    await transformAndLoad(data.features,pool)

}

Extract the Data

This task is done by the ExtractData() function. We are expecting a GeoJSON like data with this function.

// located in /src/lib/functions.ts

export async function ExtractData() {
    // link used in demo is https://skgrange.github.io/www/data/london_low_emission_zones.json
    const res = await fetch('https:///')
    const data =  await res.json()
    // add validation function to handle error here.
    return data
}

In this function we have an asynchronous function with name ExtractData . the export statement tells that this function can be imported in another typescript file. This makes the project modular. The async syntax states that this is an asynchronous function. This allows us to run multiple functions at the same time; mainly because we will use fetch API to conduct HTTP GET request to the server.

Fetch API is an asynchronous function that is easier to be wrapped inside an asynchronous function. Thus we have the await statement to wait for the HTTP Get request to finish. We will have the data in the form of GeoJSON format.

I find fetch API & asynchronous function to be confusing for beginners. Furthermore on how asynchronous function complements Fetch API is in my other article.

Using Fetch API

Transform and Load the Data

After we obtain the data, we can then transform it and load the transformed data into postgis. We will use node-postgres as the client that execute SQL with nodeJS. I prefer node-postgres as I like writing raw SQL code. Furthermore, we have granular control over the spatial SQL. The drawback is that our code might look complicated as we might write rows of SQL queries.

node-postgres

This one is a bit long. I will highlight the important parts, mainly the spatial SQL bit.

// This function transforms and loads the data into postgis
export async function transformAndLoad(geojsonFeatures: Feature[], pool: pg.Pool) {
    const client = await pool.connect();

    try {
      await client.query('BEGIN'); // Start a new transaction

      for (const geojsonPolygon of geojsonFeatures) {
        const geometryType = geojsonPolygon.geometry.type;
        const coordinates = geojsonPolygon.geometry.coordinates;

        const insertQuery = `
          INSERT INTO public.data (geometry, name, type,id)
          VALUES (ST_setsrid(ST_GeomFromGeoJSON($1),4326), $2, $3,$4)
        `;

        const values = [
          JSON.stringify({ type: geometryType, coordinates }),
          geojsonPolygon.properties.name,    
          geojsonPolygon.properties.type,    
          getISOstring()
        ];

        await client.query(insertQuery, values); // Insert the GeoJSON data
      }

      await client.query('COMMIT'); // Commit the transaction if successful
        // await client.query('ROLLBACK');
        console.log('Bulk insert of POLYGON features successful');
    } catch (error) {
      await client.query('ROLLBACK'); // Roll back the transaction on error
      console.error('Bulk insert failed:', error);
    } finally {
      client.release(); // Release the database connection
      pool.end()
    }
  }

Safe Data Transaction with Node-postgres

The essential recipe of inserting bulk rows in PostgreSQL is the following part of the code:

await client.query('BEGIN');
try {
// ... start inserting data...
    await client.query('COMMIT');
} catch {
    // this code runs when we can't insert the data somehow. handle the error.
    // cancels the data transactions as a whole, avoiding duplicates.
    await client.query('ROLLBACK');
}

The client.query("BEGIN") allows us to stage the inserting process for each record of the data. This means that if there is something wrong along the inserting process, (for instance, the data at row-n is invalid thus postgresql throws an error) we can cancel the data transaction as a whole. This cancelation is done with client.query('ROLLBACK'). If all of the rows are inserted correctly without errors, then we can declare this as a safe insert and invoke client.query("COMMIT") to commit the data transaction.

If we do not begin with the client.query("BEGIN") , any errors that occur during the insertion process will result in the data being inserted into the database anyway. In most cases, we do not want this.

Parameterised Query

Let's see what we do after we invoke client.query("COMMIT") . This is where we transform and insert our spatial data. There isn't much to the transformation bit as I only extract some of the property.

// transformation
const geometryType = geojsonPolygon.geometry.type;
const coordinates = geojsonPolygon.geometry.coordinates;

The most important bit is how to use parameterised query. A common SQL vulnerability is SQL injection. Parameterised query mitigate this vulnerability by not inserting raw SQL into our data transaction. Look at the following SQL command.

// Correct way
const insertQuery = `
          INSERT INTO public.data (geometry, name, type,id)
          VALUES (ST_setsrid(ST_GeomFromGeoJSON($1),4326), $2, $3,$4)
        `;

Please note the string $1 , $2 , $3 , and $4 . These are the placeholders for the actual values we insert. We should not use template literal where we format our string. In Javascript Node-postgres, you should NOT do the following.

// DON'T DO THIS
let value: string;
const insertQuery = `
          INSERT INTO public.data (name)
          VALUES ("${value}")
        `;

You are subjecting the database to malicious queries. For instance if the ${value} value is a DROP TABLE command, then the function will execute the DROP TABLE command! This is not possible if we use the $1 parameterised query.

Inserting with Spatial SQL

Now let's see the insert statement and how I handle the spatial data.

-- inserting spatial data
INSERT INTO public.data (geometry, name, type,id)
VALUES (ST_setsrid(ST_GeomFromGeoJSON($1),4326), $2, $3,$4)

The ST_GeomFromGeoJSON is a postgis function that takes a stringified geometry and converts it into a geometry object. For instance,

ST_GeomFromGeoJSON('{"type":"Point","coordinates":[-48.23456,20.12345]}')

We then wrap this inside ST_SetSRID function to declare its projection. I am using the EPSG code 4326 here to indicate that this is a Latitude and Longitude values. You might receive data in different projections. For instance, in the United Kingdom, the data are often projected into British National Grid which has the code 27700 ; Northing and Easting. This is not the case for this demo.

Release the Connection

Finally, we release the client to enable the server to handle another clients.

client.release(); 
pool.end();

Running the Code

Let's run the code with npm run start which runs npx tsc (this compiles the typescript code into javascript) and node build (this actually executes the javascript code that runs the ETL process). The log says it has succeeded.

Executing the code (source: author, 2023)

Let's look at the data with PgAdmin or plain psql by selecting the data. Something like,

SELECT name, type, id, geometry FROM data

Using PgAdmin, we can see that the data makes sense in the map! This means the data analysts can now start analysing the data.

viewing the data (source: author, 2023)

We can also start playing with the data in QGIS, an open source GIS software we can use to analyse geospatial data.

viewing the data with QGIS (source: author, 2023)

Conclusion

Spatial Data Engineering's main idea is creating a data pipeline that draws the raw data from one source, treat it, and have it stored for use. This allows spatial data analysts to focus on analysing data without to worry about data availability; which is handled by the spatial data engineers. What makes spatial data engineering special is how we handle spatial data, that is, using spatial SQL. Using typescript with Node.JS, powered by node-postgres, we can construct a simple data pipleline. The data pipeline fetches geojson data, transform it, and stores the it inside a PostgreSQL database server with spatial data capability enabled by Postgis. This stored data can be then analysed using GIS software.

Tags: Data Engineering GIS Nodejs Sql Typescript

Comment