Geospatial Data Analysis with GeoPandas

This is the third article of the series regarding Geospatial Data Analysis:
- Geospatial Data Analysis using QGIS
- Guide for getting started with OpenStreetMap
- Geospatial Data Analysis with GeoPandas (this post)
- Geospatial Data Analysis with OSMnx
- Geocoding for Data Scientists
- Geospatial Data Analysis with Geemap
This article is in continuation of the stories A Practical Introduction to Geospatial Data Analysis using QGIS and A comprehensive guide for getting started with OpenStreetMap. In the previous tutorials, I provided an overview of geospatial data analysis, which is a subfield that is ubiquitous and can be applied in many fields, such as logistics, transportation, and insurance.
This discipline is focused on analyzing a special type of data, geospatial data, which is characterized by having a location, described by one or more pairs of coordinates. Examples can be restaurants, roads, and boundaries between countries. To show a continuous surface, like a satellite image, a geographical table is not enough anymore and you need an array with one or more channels.
In this article, I am going to focus on the simplest case, the geographical table, also called vector data. For this task, GeoPandas is the Python library that will be used to manipulate and visualize this type of geospatial data. As you may guess, it's an extension of Pandas, a popular Python package, that allows you to work with geospatial data easily and fast. Let's get started!
Table of contents:
- Import census data
- Add geometry to census data
- Create a map with GeoPandas
- Extract centroid from geometry
- Create a more complex map
Import census data
The best way to begin the journey with geospatial data analysis is by making practice with census data, which gives a picture of all people and households in the countries of the world at the granular level.
In this tutorial, we are going to use a dataset that provides the number of cars or vans in the United Kingdom and comes from the UK Data Service. The link to the dataset is here.
I will start with a dataset that doesn't contain geographic information:
Each row of the dataset corresponds to a specific output area, which is the lowest geographical level at which census is provided in the UK. There are three features: the geocode, the country and the number of cars or vans that are owned by one or more members of a household.
If we would like to visualize the map right now, we wouldn't be able because we don't have the necessary geographical information. We need a further step before showing the potentiality of GeoPandas.
Add geometry to census data
To visualize our census data, we need to add a column that stores the geographical information. The process for adding geographical information, for example adding latitude and longitude for each city, is called geocoding.
In this case, it's not just a pair of coordinates, but there are different pairs of coordinates that are connected and closed, forming the boundaries of the output areas. We need to export the Shapefile from this link. It provides the boundary for each output area.
Once the dataset is imported, we can merge these two tables using their common field, geo_code:
After assessing the dimension of the dataframe didn't vary after the left join, we need to check if there are null values in the new column:
df.geometry.isnull().sum()
# 0
Luckily there are no null values and we can convert our dataframe into a Geodataframe using the GeoDataFrame class, where we set up the geometry column as geometry of our geodataframe:
Now, geographical and non-geographical information are combined into a unique table. All the geographical information is contained in a single field, called geometry. Like in a normal dataframe, we can print the information of this geodataframe:
From the output, we can see that our geodataframe is an instance of the geopandas.GeoDataFrame
object and the geometry is encoded using the geometry type. To have a better understanding, we can also display the type of the geometry column in the first row:
type(gdf.geometry[0])
# shapely.geometry.polygon.Polygon
It's important to know that there are three common classes in the geometric object: Points, Lines and Polygons. In our case, we are dealing with Polygons, which make sense since they are the boundaries of the output areas. Then, the dataset is ready and we can start to build nice visualizations from now on.
Create a Map with GeoPandas
Now, we have all the ingredients to visualize the map with GeoPandas. Since one of the drawbacks of GeoPandas is the fact that it struggles with huge amounts of data and we have more than 200 thousand rows, we'll just focus on the census data of Northern Ireland:
gdf_ni = gdf.query('Country=="Northen Ireland"')
To create a map, you just need to call the plot()
method on the Geodataframe:
We also would like to see how the number of cars/vans is distributed within Northern Ireland by coloring each output area based on its frequency:
From this plot, we can observe that most of the areas have around 200 vehicles, except for small areas marked in green colour.
Extract centroid from geometry
Let's suppose that we want to change the geometry and have the coordinates in the centre of the output areas, instead of the polygons. This is possible by using the gdf.geomtry.centroid
property to compute the centroid of each output area:
gdf_ni['centroid'] = gdf.geometry.centroid
gdf_ni.sample(3)

If we display again the information of the dataframe, we can notice that both geometry and centroid are encoded as geometry types.
The better way to understand what we really obtained is to visualize both geometry and centroid columns in a unique map. To plot the centroids, it's needed to switch the geometry by using set_geometry()
method.
Create more complex maps
There are some advanced features to visualize more details in the map, without creating any other informative column. Before we have shown the number of cars or vans in each output area, but it was more confusing than informative. It would be better to create a categorical feature based on our numerical column. With GeoPandas, we can skip that passage and plot it directly. By specifying the argument scheme='intervals'
, we are able to create classes of cars/vans based on equal intervals.
The map didn't change a lot, but you can see that the legend is much more clear compared to the previous version. A better way to visualize the map would be to colour it based on levels built using quantiles:
Now, it's possible to spot more variability within the map since each level contains a more distributed number of areas. It's worth noticing that most areas belong to the last two levels, corresponding to the highest number of vehicles. In the first visualization, 200 vehicles seemed a low number, but there was instead a high number of outliers with high frequencies that distorted our interpretation.
At this point, we also would like to have a background map to contextualize better our results. The most popular way to do it is by using contextily library, which allows to get a background map. This library requires the Web Mercator coordinate reference system (EPSG:3857). For this reason, we need to convert our data to this crs. The code to plot the map remains the same, except for an additional line to add the base map from Contextily library:
That's cool! Now, we have a more professional and detailed map!
Final thoughts:
This was an introductory tutorial for getting started to make practice with geospatial data using Python. GeoPandas is a Python library specialized in working with vector data. It's very easy and intuitive to use since it has properties and methods similar to Pandas, but it becomes very slow as soon as the amount of data grows, in particular when plotting the data.
In addition to his bad point, there is the fact that it depends on the Fiona library for reading and writing vector data formats. In case Fiona doesn't support some formats, even GeoPandas is able to support them. One solution can be by using in combination GeoPandas to manipulate data and QGIS to visualize the map. Or trying other Python libraries to visualize the data, like Folium. Do you know other alternatives? Suggest them in the comments, if you have other ideas.
The code can be found here. I hope you found the article useful. Have a nice day!
Disclaimer: The data sets are licensed under UK Open Government License (OGL)
Useful Resources: