HERE Probe Data Visualization with Python
Our series on HERE Probe Data analysis and visualization is designed to provide developers with a range of options for processing and visualizing data. Regardless of your preferred GIS tool, we aim to make it easy for you to get started with HERE Probe Data.
In previous blog posts, we have covered the use of ArcGIS Pro and QGIS. In this post we will walk you through the steps on how to use the highly versatile Python programming language to preprocess probe data and visualize the results.
Our dataset in this exercise includes two days of HERE Probe Data in Paris. We will first create a static plot with all 16.4 million points in our dataset, after which we will render a time lapse video showing only the high-speed probe traffic for the two-day period.
Initial setup and required libraries
While this step is optional, it's recommended that you create a new virtual environment for the project to isolate the Python library dependencies. We will not cover how to create a virtual environment in this guide, however there are many great tutorials online that can help you if you need assistance.
The required Python libraries are geopandas, shapely, duckdb, matplotlib and datashader, moviepy, pillow. You can install them by running the following command in your terminal:
pip install geopandas shapely duckdb datashader matplotlib moviepy pillow
You can check whether the libraries have been correctly installed by running the 'pip list' command in your terminal and verifying that the output contains the libraries above.
Importing the required libraries and creating a DuckDB database
Given the large file size of probe datasets, we will use DuckDB, a popular in-process SQL OLAP database management system for preprocessing. DuckDB comes with a native Python client API, which provides an easy way to interact with our database through Python.
In a previous blog post, we have shown how to convert probe data protobuf into csv files, which has attributes such as latitude, longitude, speed, heading, timestamp, and traceID.
Let's import the libraries we have installed in the previous step and copy the probe data into a new DuckDB database.
import geopandas as gpd
import duckdb
import datashader as ds
import pandas as pd
import glob
import os
from PIL import Image, ImageDraw, ImageFont
from matplotlib.cm import inferno, viridis, Greys
import moviepy.editor as mp
import moviepy.video.fx.all as mpfx
# Create duckdb database establish connection and install the spatial extension
conn = duckdb.connect("ParisProbeData")
conn.execute("INSTALL spatial");
conn.execute("LOAD spatial;")
# Ingest all probe data csv files into the database
conn.execute(r"""
CREATE TABLE probe_data
AS SELECT *,
FROM 'probe_dataset/*.csv';""")
# Print summary statistics of the table
print(conn.sql("SUMMARIZE probe_data;"))
The print statement's output should provide a basic overview of our dataset:
┌─────────────┬─────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬────────────────────┬────────────────────┬────────────────────┬──────────┬─────────────────┐
│ column_name │ column_type │ min │ max │ approx_unique │ … │ q25 │ q50 │ q75 │ count │ null_percentage │
│ varchar │ varchar │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │ int64 │ varchar │
├─────────────┼─────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼────────────────────┼────────────────────┼────────────────────┼──────────┼─────────────────┤
│ column0 │ BIGINT │ 0 │ 667034 │ 652568 │ … │ 81622 │ 178417 │ 314751 │ 16454620 │ 0.0% │
│ heading │ DOUBLE │ 0.0 │ 359.0 │ 1200 │ … │ 83.10102004138378 │ 176.0372118490491 │ 262.46800697623644 │ 16454620 │ 0.0% │
│ latitude │ DOUBLE │ 48.812255859375 │ 48.9111328 │ 7058689 │ … │ 48.83452853162131 │ 48.8603472899127 │ 48.8811803960392 │ 16454620 │ 0.0% │
│ longitude │ DOUBLE │ 2.2192383 │ 2.47192363 │ 8810441 │ … │ 2.2906945866110604 │ 2.3382695703643774 │ 2.3898880023723112 │ 16454620 │ 0.0% │
│ traceid │ VARCHAR │ 0008ZGZRHiQ2GOqkHXwg │ zzzuwbFQScm2ATRGz9… │ 308539 │ … │ NULL │ NULL │ NULL │ 16454620 │ 0.0% │
│ sampledate │ TIMESTAMP │ 2023-03-15 00:00:00 │ 2023-03-16 23:59:59 │ 170097 │ … │ NULL │ NULL │ NULL │ 16454620 │ 0.0% │
│ speed │ DOUBLE │ 0.0 │ 240.0 │ 205 │ … │ 8.935393671301886 │ 21.67506148307818 │ 39.01931355193009 │ 16454620 │ 0.0% │
├─────────────┴─────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴────────────────────┴────────────────────┴────────────────────┴──────────┴─────────────────┤
│ 7 rows 12 columns (10 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Data preprocessing
Our next step is to amend the data table and prepare it for visualization.
First up, we will create a bounding box based on the minimum and maximum lat/lon values. This will be needed to define the canvas boundaries in our visualization.
We will also convert the 'speed' variable into categorical values and verify the result:
# Get bounding box of dataset
df = conn.execute("SELECT * FROM probe_data;").fetch_df()
lat_min = df['latitude'].min()
lat_max = df['latitude'].max()
lon_min = df['longitude'].min()
lon_max = df['longitude'].max()
bbox = (lon_min, lat_min, lon_max, lat_max)
print(f"Bounding box: {bbox}")
# Create speed_category column
conn.execute("""
ALTER TABLE probe_data
ADD COLUMN speedcat VARCHAR;
""")
conn.execute("""
UPDATE probe_data
SET speedcat = (CASE
WHEN speed > 50 THEN 'high_speed_above_50'
WHEN speed BETWEEN 30 AND 50 THEN 'medium_speed_30_50'
WHEN speed < 30 THEN 'low_speed_sub_30'
ELSE 'unknown_speed'
END
);
""")
print(conn.execute("""
SELECT speedcat, COUNT(*) AS counts
FROM probe_data
GROUP BY speedcat;""").fetch_df())
If successful, the terminal will print out our bounding box and the probe point counts per speed category:
Bounding box: (2.2192383, 48.812255859375, 2.47192363, 48.9111328)
speedcat counts
0 low_speed_sub_30 10389883
1 medium_speed_30_50 3347051
2 high_speed_above_50 2717686
Plotting the dataset
For this step we will be using Datashader, a highly optimized rendering pipeline which can handle massive datasets.
We will start by preparing the canvas size and location, using the bounding box values we computed from our data. Once our canvas is ready, we will plot our probe points by passing the data as a pandas dataframe and defining the color map.
Finally, we'll save our plot to a png image in our project folder.
# Datashader canvas preparation
cvs = ds.Canvas(plot_width=1000,
plot_height=600,
x_range=[bbox[0], bbox[2]],
y_range=[bbox[1], bbox[3]]
)
# Pull the probe dataset from duckdb into a pandas dataframe
df = conn.execute("SELECT column0, longitude, latitude, speedcat, sampledate FROM probe_data;").fetch_df()
# plot the probe points on the canvas
agg = cvs.points(df, x='longitude', y='latitude')
# Use datashader's transfer_functions to adjust the visualization parameters then save to file
img=(ds.transfer_functions.shade(agg, cmap = viridis, how='log')).to_pil()
img.save("frame.png")
If successful, your newly created probe data plot with more than 14 million points should be waiting for you in your project folder!
Feel free to experiment by tweaking the color map and other variables in the Datashader functions as you can achieve quite striking results! Here are some of our favorites:
Timelapse visualization
As a final exercise, we will create a timelapse video from our dataset to understand where and at what time high speed traffic occurs in Paris.
# Filter on high speed points and resample the dataframe into thirty minute splits
df = df[df['speedcat'] == 'high_speed_above_50']
df['sampledate'] = pd.to_datetime(df['sampledate'])
resampled = df.resample("30T", on='sampledate')
# Iterate over the resampler object and store the sliced dataframes in a dictionary
df_dict = {}
for i, (timestamp,df) in enumerate(resampled):
df_dict[i] = df
# Loop through the sampled dataframes and create frames
for i, df in enumerate(df_dict):
df_frame = df_dict[i]
# get day value of sample
day = df_frame['sampledate'].iloc[0].day_name()
# create datashader canvas
cvs = ds.Canvas(plot_width=1000,
plot_height=600,
x_range=[bbox[0], bbox[2]],
y_range=[bbox[1], bbox[3]]
)
agg = cvs.points(df_frame, x='longitude', y='latitude')
img = (ds.transfer_functions.shade(agg, cmap = inferno, how='log')).to_pil()
# add time of day to frame image
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("verdana.ttf", 24)
draw.text((10, 10), '{:d}:{:02d}'.format(df_frame['sampledate'].iloc[0].hour, df_frame['sampledate'].iloc[0].minute), 'white', font=font, min_alpha=70)
# add number of probe points to frame image
font = ImageFont.truetype("verdana.ttf", 18)
draw.text((10, 40), '# of probe points: {:,}'.format(len(df_frame.index)), 'grey', font=font, min_alpha=70)
img.save(f"frames/frame{i}{day}.png")
# Compile individual frames into video file
imgs = glob.glob("frames/*.png")
imgs = sorted(imgs, key=lambda t: os.stat(t).st_mtime)
clips = [mp.ImageClip(m).set_duration(0.10) for m in imgs]
concat_clip = mp.concatenate_videoclips(clips, method="compose")
concat_clip.write_videofile("probe_animation.mp4", fps=60)
Conclusion
We hope this quick tutorial will be useful in getting you started with HERE's extensive Probe Data library.
Stay tuned for future updates, tutorials, and use cases showcasing HERE Probe Data. For more information about HERE Probe Data, visit the HERE Probe Data User Guide.
Have your say
Sign up for our newsletter
Why sign up:
- Latest offers and discounts
- Tailored content delivered weekly
- Exclusive events
- One click to unsubscribe