Geospatial Data Science: Points Pattern Analysis

Author:Murphy  |  View: 27484  |  Time: 2025-03-23 12:35:23
Photo by Bernard Hermant on Unsplash

Introduction

Geospatial Data Science is a sub-area of the Data area that deals with the analysis of data points taking in consideration where in space that event has happened.

Let's say we own a chain of retail stores that sells smartphones. Our chain has a few distribution centers and we are about to open a couple of new stores. Where can we do that?

Such insight could come from a Geospatial analysis that would show us where the sales are concentrated, if there are clusters where the sales are higher or lower and other insights.

The point pattern analysis enters in this game when we want to make sure we are looking at a geographically clustered dataset. Just like much of our work as Data Scientists, points pattern is about creating a hypothesis and removing much of the uncertainty to confirm it or not using statistics applied to data. In this case, it is not different. There are a couple of stats tests to be done, what will be shown in this post.

By the way, we have been studying Geospatial Data Science lately here in my blog. If you don't know much about the subject, here are two good reads before you dive into this post.

Analyzing Geospatial Data with Python

Analyzing Geospatial Data with Python (Part 2 – Hypothesis Test)

Coding

Packages

Let's start with the packages to be used in this exercise. If any of them are not installed in your environment, don't forget to use pip install or conda install (for Anaconda users) followed by the package name.

import pandas as pd
import numpy as np
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import contextily

# Spatial Stats
from pointpats import distance_statistics, QStatistic, random, PointPattern

Dataset

The dataset to be used is, once again, the listings from AirBnb for the city of Asheville, in North Carolina, USA. The data can be retrieved from an independent project in the website http://insideairbnb.com/, where anyone can go and download the datasets for analysis. The data is open under the Creative Commons Attribution 4.0 International License.

I have downloaded the file listings.csv.gz.

To load the data to a Python session, here's the code. The first snippet is a simple read_csv() from Pandas, where we pre-determined which columns we would like to pull from the raw data. Then, we used gpd.GeoDataFrame to convert the dataset to the Geopandas object type, specifying the columns to use as X and Y axes, plus the geo coordinate system (crs — use 4326 which is the same as the GPS systems, one of the most common reference systems).

# Import the file to this exercise
# Open listings file
listings = pd.read_csv('/content/listings.csv',
                       usecols=['id', 'property_type', 'neighbourhood_cleansed',
                                'bedrooms', 'beds', 'bathrooms_text', 'price',
                                'latitude','longitude'])

# Convert the file to GeoPandas
points_gpd = gpd.GeoDataFrame(listings,
                              geometry= gpd.points_from_xy(
                                  x=listings.longitude,
                                  y=listings.latitude),
                              crs= "EPSG:4326")

Basically, the transformation to Geopandas is the creation of this column geometry and the object type.

Geopandas dataset. Image by the author.

Great. With that done, let's have a quick look at the map. The code is simple, with a figure fig and axes ax created with subplots, since there will be two plots. One for the base map and another for the points.

#Quick check gpd dataframe
fig, ax = plt.subplots(figsize=(8,8))
# zorder=1 is the plot below
asheville.plot(ax=ax, color=None, zorder=1)
# zorder=2 is the top layer
points_gpd.plot(ax=ax, zorder=2, color='black', markersize=8)

This code yields the next plot.

Asheville, NC listings from Airbnb. Image by the author.

Cool. The map is looking ok, but it does not give us too much information. Let's enhance our analysis now.

Point Pattern

The first analysis we can do when thinking about Points Pattern Analysis is checking how concentrate those data points are in geographical terms.

Here, the seaborn library can be of help. The method jointplot brings us a scatterplot together with histograms on the margins. It is an awesome addition to the analysis, as it gives us insights about how concentrate are the points and where, just by looking where the bars are higher.

To create it, we can pass the x and y from the geometry column, plus the dataset (data), the size of the points (s) and color, and the height of the graphic. The next snippet is to add a basemap to this jointplot, which can be done with contextily, giving the jointplot variable to the method.

Python"># Check concentration of the points
plot2 = sns.jointplot(
    x= points_gpd.geometry.x,
    y= points_gpd.geometry.y,
    data= points_gpd,
    s=5, height=7, color='k')

# Add a basemap to the jointplot
contextily.add_basemap( plot2.ax_joint,
                       crs="EPSG:4326",
                        source= contextily.providers.Stamen.TonerLite)

As the result, we see this beautiful map.

Jointplot over a base map of Asheville, NC. Image by the author.

I love this plot. From it, we can quickly get some good insights already. The downtown area of the city marked in red (and surroundings) is indeed where the listings are concentrated. And then we can also notice that as we go farther from that region, the number of listings drops and gets more sparse.

That makes perfect sense. Let's think about that: Airbnb is a platform for people to rent their houses or bedrooms. Those rental properties are normally in residential areas, since the main purpose of them is to serve as a home, not a business. And the location of house communities are normally in areas where there are urban infra-structure around, like malls, grocery store, pharmacy, banks etc. As houses in mountain remote locations are more challenging to be built, it is expected that there would be less location points there too.

Statistical Tests

Now that we have already plotted the jointplot and got some good insights, we still need to test the points pattern to know if they are statistically clustered or not. I mean, those points could be all together just by chance too. Who knows.

So, to test and make sure you're working with a pattern, there are two good tests:

  • Ripley's G: This test will check the cumulative distribution of the distances of a point to its nearest neighbors. So, the test measures the distance from a given house to neighbors 1, 2, 3, …, n and compare that distribution of distances with a simulated random distribution of points. If we observe that the observed data behaves differently than the simulation for a determinate space, we can conclude that the data has a pattern, thus it is clustered.
  • Ripley's K: This test makes a similar test, comparing the observed data with a random simulated distribution. The difference to the G test is that the K test considers all the distances in the data, not just the closest neighbors.

Ripley's G tests the distribution of distances to the nearest neighbors. Ripley's K tests the distribution of distances to the entire dataset.

Ripley's G

Let's learn how to perform those tests now. First, the G test. It is simple enough to write, but it takes a while to run, depending on the size of the dataset. Here, the test is getting 40 neighbors distances. It took about 6 minutes to run.

# Coding Ripley's G (6 mins to run)
ripley_g = distance_statistics.g_test(points_gpd[['longitude', 'latitude']].values,
                                      support=40,
                                      keep_simulations= True)

To plot the result, the code snippet is as follows. We are plotting one black line for the median of each simulation and a red line with the statistic for each data point.

# Plot G test
plt.figure(figsize=(20,7))
# Simulated Data line plot
plt.plot(ripley_g.support,
         np.median(ripley_g.simulations, axis=0),
         color='k', label= 'Randomly Simulated Data')
# Ripley Stat plot for Observed data
plt.plot(ripley_g.support,
         ripley_g.statistic, marker='o',
         color='red', label= 'Observed Data')
# Plot setup
plt.legend(loc=4)
plt.xlabel('Distance')
plt.xticks( np.arange(0.0, 0.023, 0.001) )
plt.ylabel('Ripley G function statistic')
plt.title('Rypleys G Test')
plt.show()

As a result, the code displays the next figure.

Ripley's G test. Image by the author.

We can see that, for the distances of 0 and 0.003, the observed data grows faster than the simulated data, confirming that there are significant spatial patterns in the dataset.

Ripley's K

The second test is the K test. This one works with a null hypothesis of complete spatial randomness and an alternative hypothesis of spatial pattern in the data.

Running the K test is as easy as the other one. However, the test is more computationally expensive. It took more than 30 minutes to run in a session on Google Colab.

# Coding Ripley's K
ripley_k = distance_statistics.k_test(points_gpd[['longitude', 'latitude']].values,
                                      keep_simulations= True)

# Plot K test

# Simulated Data line plot
plt.plot(ripley_k.support,
         ripley_k.simulations.T,
         color='k', alpha=.1)
# Ripley Stat plot for Observed data
plt.plot(ripley_k.support,
         ripley_k.statistic, marker='x',
         color='orange')

# p<0.05 = alternative hypothesis: spatial pattern
plt.scatter(ripley_k.support,
            ripley_k.statistic,
            cmap='viridis', c=ripley_k.pvalue < .05,
            zorder=4)
# Plot setup
plt.xlabel('Distance')
plt.ylabel('Ripley K function')
plt.title('Rypleys K Function Plot')
plt.show()

The code above will give us the next plot.

Ripley's K test. Image by the author.

Once again, the observed data is much higher than the simulations, confirming our spatial pattern.

Before You Go

In this post, we learned how to create some statistical tests for Geospatial points pattern confirmation.

Once we plot some data points on a map, they can be clustered just by chance, without a clear pattern. But a good way to confirm the geospatial pattern is to run the Ripley's G and Ripley's K tests.

One can benefit of those tests to confirm that there is a pattern in clusters with high and low prices in the city listings of Asheville, NC, for example. Therefore, if you're listing a rental property, you would know the best price to compete in each neighborhood of that city.

Now, you can download the code in my repo at GitHub and apply to your data.

Studying/Python/Geospatial/Points_Pattern_Python.ipynb at master · gurezende/Studying

If you liked my content, don't forget to follow me or find me on LinkedIn.

Gustavo Santos – Medium

Reference

JORDAN, David S. [2023]. Applied Geospatial Data Science with Python. 1 ed. Pactk Publishing.

A Statistical Test for Ripley's Function Rejection of Poisson Null Hypothesis

pointpats/notebooks/distance_statistics-numpy-oriented.ipynb at main · pysal/pointpats

Tags: Data Science Geospatial Geospatial Data Points Pattern Python

Comment