Exploratory Data Analysis: Lost Property Items on the Transport of London

Author:Murphy  |  View: 26648  |  Time: 2025-03-22 22:06:58
London Underground, Image by author

As readers may guess, this story had a trivial start: I forgot my bag on the bus. Five minutes later, I realized that the bag was missing, but the bus had already left. After coming home, I checked the bus company's website for the possibility of claiming the lost bag, and several days later, I was lucky enough to get it back. I live in Amsterdam, and public transport here has a partnership with an iLost company, where people can claim their lost property. This site has a pretty clear structure and does not even require registration to view items forgotten by different people (personal details are, obviously, hidden). Having a sort of data-oriented mindset, I got a "Eureka" moment – data of this kind can be great from the cultural anthropology perspective, and we can learn a lot about what kind of goods can be lost in public transport and other places. Alas, the iLost license agreement does not allow the use of the data without written consent, and nobody responded to my question. But still having this idea in mind, I started to search online for alternative sources, and it turned out that:

  • Transport for London (TfL) also has a good service for claiming lost property items.
  • The United Kingdom has a Freedom of Information Act that creates a public "right of access" to information held by public authorities. Every person has the right to make a request free of charge and to get a response within 20 working days. So, I asked TfL to send me (if possible) a CSV file with "raw data" of lost items, and after 2–3 weeks (as promised, it was not fast:), I indeed got the file. I also asked a TfL officer if I could use this data for my TDS publication, and I got a positive answer.

I think it's a great service (by the way, the US has had a similar law since 1967) and a good opportunity for scientists and data enthusiasts to use public data for their research. So, without further ado, let's see what kind of information we can get. If someone wants to get a copy of the original file to reproduce the results, write a comment below, and I will share a link.

Loading The Data

First, let's load the data and see what dimensionality and type it has:

Python">import pandas as pd

df = pd.read_csv("tfl.csv")
display(df.head())
display(df.info(verbose=True))

The output looks like this:

A dataframe summary, Image by author

As we can see, we have 5245 items, and there are no NULL items in the dataframe. The date format looks non-standard, so let's convert it:

df["Date Found"] = pd.to_datetime(df["Date Found"], format="%d/%m/%Y")

As a next step, let's figure out what kind of categories we have:

display(df["Category"].unique())

The output shows that the number is not that large:

array(['Bags', 'Electronics & Technology', 'Wallets & Purses',
       'Baby & Nursery', 'ID & Personal Documents', 'Health & Beauty',
       'Tools / Garden / DIY', 'Jewellery', 'Travel Cards & Ticket',
       'Household & General Items', 'Keys', 'Sports & Leisure', 'Eyewear',
       'Currency (cash)', 'Stationery & Books', 'Clothing',
       'Financial Documents'], dtype=object)

To make the visualization look better, I decided to rename the "Electronics & Technology" category to "Electronics" and combine all "Document" categories into a single one:

def update_category(name: str) -> str:
    """ Update category name """
    if "Documents" in name:
        return "Documents"
    if "Electronics" in name:
        return "Electronics"
    return name

df["Category"] = df["Category"].map(update_category)

After the date and category conversion, our final dataframe looks like this:

Image by author

Items Per Day

Now, we are ready to have some fun. Let's group all items by date and draw a bar chart:

import plotly.express as px

gr_day = df[["Date Found"]].groupby(["Date Found"], as_index=False).size()

fig = px.bar(gr_day, x="Date Found", y="size",
             title="TfL, Lost Items Per Day",
             width=1280, height=500)
fig.update_xaxes(tickformat="%a, %d-%m-%Y", showline=True, linecolor="black")
fig.update_layout(xaxis_title=None, yaxis_title=None, plot_bgcolor="#F5F5F5",
                  margin=dict(l=50, r=50, t=30, b=50))
fig.show()

Here, I used an open-source Plotly library to build the chart. This library is based on plotly.js, so we can use HTML formatting to change the style. The output is also interactive, and we can zoom or move the image directly in the notebook.

The output looks like this:

Lost items per day chart, Image by author

With these results, it's interesting to calculate the probability of losing something on public transport. I don't know the exact number of passengers on these particular days, but generally, we know that London's public transport handles about 6 million passenger journeys per day. As we can see from the graph, the number of lost items is pretty consistent. A simple calculation shows that the probability of losing something in transport is about 0.01%. The value itself is not big but is also not minuscule – for every 10,000 people, at least one person per day may have this event occur.

Another interesting finding is a peak on Thursday. I don't know why it happens, but another TfL press release also mentioned Thursday as the day with the maximum number of passengers. Is there any social explanation behind this? I don't know, but it looks interesting.

Categories

As a next step, let's group all goods by category and sub-category:

gr_cat = df[["Category",
             "Sub-Category"]].groupby(["Category",
                                       "Sub-Category"], as_index=False).size()

I tried different types of visualization, and two of them, in my opinion, are the most informative. First, let's draw a sunburst chart, which is a sort of 2-dimensional pie chart:

fig = px.sunburst(gr_cat, width=1280, height=800,
                  path=["Category", "Sub-Category"], values="size",
                  color="Category",
                  title="TfL, Lost Items Chart"
                  )
fig.update_layout(font_size=10, margin=dict(l=10, r=10, t=30, b=50))
fig.update_traces(textinfo="label+percent parent")
fig.show()

The output looks like this:

A sunburst diagram, Image by author

It is easy to compare different segments; for example, we can see that "Bags" and "Electronics" are at the top of our "lost chart" (19% and 15%, respectively). And in the "Bags" category, passengers most often (36%) lose their backpacks. This makes practical sense because it's not convenient to stand or sit with a backpack, and people take them off. In the "Electronics" category, mobile phones are at the top (58%). As for other categories, 10% of unlucky passengers lose their documents, and 4% lose their keys.

A sunburst chart looks interesting, but it may be hard to read narrow segments. Another alternative is a treemap chart:

fig = px.treemap(gr_cat, width=1280, height=800,
                 path=['Category', 'Sub-Category'], values='size',
                 color='Category')
fig.update_traces(textinfo="label+percent parent")
fig.show()

The output looks like this:

A treemap chart, Image by author

In this case, it's easier to read data from minor sub-categories by hovering the mouse over them. On the other side, with a pie chart, it is easier to grasp the relative size of different categories.

Sub-Categories

Let's investigate the "Electronics" category in more detail. To do this, we can filter the dataframe, group all items by a sub-category, and draw a bar chart:

df_ = df[df["Category"] == "Electronics"]
gr_electronics = df_[["Sub-Category"]].groupby(["Sub-Category"], as_index=False).size().sort_values(by="size", ascending=True)

fig = px.bar(gr_electronics, width=1280, height=600, 
             title="TfL, Lost Items Per Week, Electronics",
             x="size", y="Sub-Category", orientation="h")
fig.update_layout(xaxis_title="Amount", yaxis_title=None,
                  plot_bgcolor="#F5F5F5",
                  margin=dict(l=50, r=50, t=30, b=50))
fig.show()

Here, I used a horizontal bar chart because it's easier to read the labels. The output looks like this:

Image by author

As we can see, most passengers have lost their phones or phone accessories, which looks obvious. Some people managed to lose their laptops, tablets, or e-readers, and one passenger lost an MP3 player (a sort of unusual device nowadays). Surprisingly, the numbers are pretty large. The dataset represents one-week data, and as we can see, 467 passengers have lost their phones during this period.

Locations

As a warm-up, let's group data by location and get the top 10 places where some goods were found:

gr_location = df[["Location"]].groupby(['Location'], as_index=False).size().sort_values(by="size", ascending=False)
display(gr_location[:10])

The output looks like this:

Image by author

I have been to London several times, but my knowledge of London stations is not that good. Let's draw all locations on the map, so it will be easier to see the places. I will use a Python geopy library to get the coordinates:

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="Python3.9")

@lru_cache(maxsize=None)
def get_coord_lat_lon(full_addr: str) -> Tuple[float, float]:
    """ Get coordinates for address """
    pt = geolocator.geocode(full_addr + ", London, UK")
    return (pt.latitude, pt.longitude) if pt else (None, None)

Here, I used the lru_cache decorator, which may be helpful if I want to run the code several times; the data will be taken from the cache instead of a new API call. I also used a tqdm Python library that allows me to see a progress bar during the processing—it's useful because the process takes several minutes:

When the processing is finished, the dataframe looks like this:

Image by author

Now, we're ready to draw a map. I will use a folium Python library for that:

import folium
from branca.element import Figure

fig = Figure(width=1024, height=600)
fmap = folium.Map(location=(51.5, -0.104),
                 tiles="openstreetmap", zoom_start=12)

for _, row in gr_location.iterrows():
    point = row["Coordinates"]
    name, amount = row["Location"], row["size"]
    if point[0] is not None or point[1] is not None:
        add_to_map(fmap, name, point, amount)

fig.add_child(fmap)
display(fig)

I also created add_to_map and value_to_color methods, which help to add a station to the map:

def value_to_color(value: int) -> str:
    """ Convert value to the HTML color """
    norm = matplotlib.colors.Normalize(vmin=0, vmax=255, clip=True)
    mapper = colormap.ScalarMappable(norm=norm, cmap=colormap.inferno)
    r, g, b, _ = mapper.to_rgba(value, alpha=None, bytes=True)
    return "#" + f"{(r << 16) + (g << 8) + b:#08x}"[2:]

def add_to_map(fmap: folium.Map, name: str,
               location: Tuple[float, float],
               value: int):
    """ Add point to the map """
    color_str = value_to_color(value)
    folium.Circle(
        location=location,
        radius=10*value//2,
        popup = name + ": " + str(value),
        color=color_str,
        fill=True,
        fill_color=color_str
    ).add_to(fmap)

The result looks like this:

A London map with markers, Image by author

Obviously, this map shows the places where the goods were found, and we don't know where they were lost (especially on the train, which moves fast), but it is still interesting to see the results. Apparently, the biggest circles on the map are the final train or bus stations, but a lot of items were found at other stations as well. Readers who wish can also change parameters like the color map (I used an "Inferno" palette) and the circle sizes to get better visualization.

Conclusion

In this article, I described the possibility of asking official institutions for public data, and we were able to analyze this data and get interesting results about property items lost in the transport of London. Generally speaking, giving people access to public information is a great idea that can help researchers and data enthusiasts find interesting pieces of data. Anyway, Statistics is a science about us. As for the results, we can see that there is a 0.01% chance of losing something on public transport, and I wish all readers not to be counted as a part of this statistic in the future

Tags: Data Science Data Visualization Programming Python Statistics

Comment