HERE Probe Data Visualization with Python

Our series on HERE Probe Data analysis and visualization is designed to provide developers with a range of options for processing and visualizing data. Regardless of your preferred GIS tool, we aim to make it easy for you to get started with HERE Probe Data.

In previous blog posts, we have covered the use of ArcGIS Pro and QGIS. In this post we will walk you through the steps on how to use the highly versatile Python programming language to preprocess probe data and visualize the results.

Our dataset in this exercise includes two days of HERE Probe Data in Paris. We will first create a static plot with all 16.4 million points in our dataset, after which we will render a time lapse video showing only the high-speed probe traffic for the two-day period.

Initial setup and required libraries

While this step is optional, it's recommended that you create a new virtual environment for the project to isolate the Python library dependencies. We will not cover how to create a virtual environment in this guide, however there are many great tutorials online that can help you if you need assistance.

The required Python libraries are geopandas, shapely, duckdb, matplotlib and datashader, moviepy, pillow. You can install them by running the following command in your terminal:

Copied

        pip install geopandas shapely duckdb datashader matplotlib moviepy pillow

You can check whether the libraries have been correctly installed by running the 'pip list' command in your terminal and verifying that the output contains the libraries above.

Importing the required libraries and creating a DuckDB database

Given the large file size of probe datasets, we will use DuckDB, a popular in-process SQL OLAP database management system for preprocessing. DuckDB comes with a native Python client API, which provides an easy way to interact with our database through Python.

In a previous blog post, we have shown how to convert probe data protobuf into csv files, which has attributes such as latitude, longitude, speed, heading, timestamp, and traceID.

Let's import the libraries we have installed in the previous step and copy the probe data into a new DuckDB database.

Copied

        import geopandas as gpd
import duckdb
import datashader as ds
import pandas as pd
import glob
import os
from PIL import Image, ImageDraw, ImageFont
from matplotlib.cm import inferno, viridis, Greys
import moviepy.editor as mp
import moviepy.video.fx.all as mpfx

# Create duckdb database establish connection and install the spatial extension
conn = duckdb.connect("ParisProbeData")
conn.execute("INSTALL spatial");
conn.execute("LOAD spatial;")

# Ingest all probe data csv files into the database
conn.execute(r"""
             CREATE TABLE probe_data
             AS SELECT *,
             FROM 'probe_dataset/*.csv';""")

# Print summary statistics of the table
print(conn.sql("SUMMARIZE probe_data;"))

The print statement's output should provide a basic overview of our dataset:

Copied

        ┌─────────────┬─────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬────────────────────┬────────────────────┬────────────────────┬──────────┬─────────────────┐
│ column_name │ column_type │         min          │         max          │ approx_unique │ … │        q25         │        q50         │        q75         │  count   │ null_percentage │
│   varchar   │   varchar   │       varchar        │       varchar        │    varchar    │   │      varchar       │      varchar       │      varchar       │  int64   │     varchar     │
├─────────────┼─────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼────────────────────┼────────────────────┼────────────────────┼──────────┼─────────────────┤
│ column0     │ BIGINT      │ 0                    │ 667034               │ 652568        │ … │ 81622              │ 178417             │ 314751             │ 16454620 │ 0.0%            │
│ heading     │ DOUBLE      │ 0.0                  │ 359.0                │ 1200          │ … │ 83.10102004138378  │ 176.0372118490491  │ 262.46800697623644 │ 16454620 │ 0.0%            │
│ latitude    │ DOUBLE      │ 48.812255859375      │ 48.9111328           │ 7058689       │ … │ 48.83452853162131  │ 48.8603472899127   │ 48.8811803960392   │ 16454620 │ 0.0%            │
│ longitude   │ DOUBLE      │ 2.2192383            │ 2.47192363           │ 8810441       │ … │ 2.2906945866110604 │ 2.3382695703643774 │ 2.3898880023723112 │ 16454620 │ 0.0%            │
│ traceid     │ VARCHAR     │ 0008ZGZRHiQ2GOqkHXwg │ zzzuwbFQScm2ATRGz9…  │ 308539        │ … │ NULL               │ NULL               │ NULL               │ 16454620 │ 0.0%            │
│ sampledate  │ TIMESTAMP   │ 2023-03-15 00:00:00  │ 2023-03-16 23:59:59  │ 170097        │ … │ NULL               │ NULL               │ NULL               │ 16454620 │ 0.0%            │
│ speed       │ DOUBLE      │ 0.0                  │ 240.0                │ 205           │ … │ 8.935393671301886  │ 21.67506148307818  │ 39.01931355193009  │ 16454620 │ 0.0%            │
├─────────────┴─────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴────────────────────┴────────────────────┴────────────────────┴──────────┴─────────────────┤
│ 7 rows                                                                                                                                                            12 columns (10 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Data preprocessing

Our next step is to amend the data table and prepare it for visualization.

First up, we will create a bounding box based on the minimum and maximum lat/lon values. This will be needed to define the canvas boundaries in our visualization.

We will also convert the 'speed' variable into categorical values and verify the result:

Copied

        # Get bounding box of dataset
df = conn.execute("SELECT * FROM probe_data;").fetch_df()
lat_min = df['latitude'].min()
lat_max = df['latitude'].max()
lon_min = df['longitude'].min()
lon_max = df['longitude'].max()

bbox = (lon_min, lat_min, lon_max, lat_max)
print(f"Bounding box: {bbox}")

# Create speed_category column
conn.execute("""
             ALTER TABLE probe_data
             ADD COLUMN speedcat VARCHAR;
             """)

conn.execute("""
             UPDATE probe_data
             SET speedcat = (CASE
                                WHEN speed > 50 THEN 'high_speed_above_50'
                                WHEN speed BETWEEN 30 AND 50 THEN 'medium_speed_30_50'
                                WHEN speed < 30 THEN 'low_speed_sub_30'
                                ELSE 'unknown_speed'
                                END
                            );
             """)

print(conn.execute("""
			SELECT speedcat, COUNT(*) AS counts 
			FROM probe_data 
			GROUP BY speedcat;""").fetch_df())

If successful, the terminal will print out our bounding box and the probe point counts per speed category:

Copied

        Bounding box: (2.2192383, 48.812255859375, 2.47192363, 48.9111328)

              speedcat    counts
0     low_speed_sub_30  10389883
1   medium_speed_30_50   3347051
2  high_speed_above_50   2717686

Plotting the dataset

For this step we will be using Datashader, a highly optimized rendering pipeline which can handle massive datasets.

We will start by preparing the canvas size and location, using the bounding box values we computed from our data. Once our canvas is ready, we will plot our probe points by passing the data as a pandas dataframe and defining the color map.

Finally, we'll save our plot to a png image in our project folder.

Copied

        # Datashader canvas preparation
cvs = ds.Canvas(plot_width=1000,
                plot_height=600,
                x_range=[bbox[0], bbox[2]],
                y_range=[bbox[1], bbox[3]]
                )

# Pull the probe dataset from duckdb into a pandas dataframe 
df = conn.execute("SELECT column0, longitude, latitude, speedcat, sampledate FROM probe_data;").fetch_df()
# plot the probe points on the canvas
agg = cvs.points(df, x='longitude', y='latitude')

# Use datashader's transfer_functions to adjust the visualization parameters then save to file
img=(ds.transfer_functions.shade(agg, cmap = viridis, how='log')).to_pil()
img.save("frame.png")

If successful, your newly created probe data plot with more than 14 million points should be waiting for you in your project folder!

Feel free to experiment by tweaking the color map and other variables in the Datashader functions as you can achieve quite striking results! Here are some of our favorites:

Timelapse visualization

As a final exercise, we will create a timelapse video from our dataset to understand where and at what time high speed traffic occurs in Paris.

Copied

        # Filter on high speed points and resample the dataframe into thirty minute splits
df = df[df['speedcat'] == 'high_speed_above_50']
df['sampledate'] = pd.to_datetime(df['sampledate'])
resampled = df.resample("30T", on='sampledate')

# Iterate over the resampler object and store the sliced dataframes in a dictionary
df_dict = {}
for i, (timestamp,df) in enumerate(resampled):
    df_dict[i] = df

# Loop through the sampled dataframes and create frames
for i, df in enumerate(df_dict):
    df_frame = df_dict[i]
    # get day value of sample
    day = df_frame['sampledate'].iloc[0].day_name()
    # create datashader canvas
    cvs = ds.Canvas(plot_width=1000,
                plot_height=600,
                x_range=[bbox[0], bbox[2]],
                y_range=[bbox[1], bbox[3]]
                )
    agg = cvs.points(df_frame, x='longitude', y='latitude')
    img = (ds.transfer_functions.shade(agg, cmap = inferno, how='log')).to_pil()
    # add time of day to frame image
    draw = ImageDraw.Draw(img)
    font = ImageFont.truetype("verdana.ttf", 24)
    draw.text((10, 10), '{:d}:{:02d}'.format(df_frame['sampledate'].iloc[0].hour, df_frame['sampledate'].iloc[0].minute), 'white', font=font, min_alpha=70)
	# add number of probe points to frame image
	font = ImageFont.truetype("verdana.ttf", 18)
    draw.text((10, 40), '# of probe points: {:,}'.format(len(df_frame.index)), 'grey', font=font, min_alpha=70)   
	img.save(f"frames/frame{i}{day}.png")

# Compile individual frames into video file
imgs = glob.glob("frames/*.png")
imgs = sorted(imgs, key=lambda t: os.stat(t).st_mtime)
clips = [mp.ImageClip(m).set_duration(0.10) for m in imgs]
concat_clip = mp.concatenate_videoclips(clips, method="compose")
concat_clip.write_videofile("probe_animation.mp4", fps=60)

Conclusion

We hope this quick tutorial will be useful in getting you started with HERE's extensive Probe Data library.

Stay tuned for future updates, tutorials, and use cases showcasing HERE Probe Data. For more information about HERE Probe Data, visit the HERE Probe Data User Guide.