Exploratory analysis of reported road incidents in Mexico City¶

In january of 2019 Mexico City's government made available to the public (https://datos.cdmx.gob.mx/pages/home/) several datasets about the city's public transport, services and security. This makes possible to explore and use those datasets to get interesting insights. I decided to give it a try and analyze the road incidents dataset in order to help to formulate strategies that could potentially improve the quality of life in the city. For this I considered two aspects; which regions of the city are more prone to vehicular accidents, and when do these accidents occur.

Lately, I've been using holoviews and geoviews for my projects and I love how easy they make to produce interactive plots. I can't recommend those libraries enough. I will use them during this blog post, if you want to learn more check this site http://holoviz.org/ for a detailed tutorial.

In [2]:

import hdbscan
import numpy as np
import pandas as pd
import holoviews as hv
import geoviews as gv
from bokeh.resources import CDN
from bokeh.embed import file_html
from IPython.display import HTML

hv.extension('bokeh')

I downloaded the dataset as CSV file (incidentes-viales-c5.csv) and I use the following function to load and clean the table:

In [3]:

def load_incidents(path, keep_confirmed: bool = True) -> pd.DataFrame:
    """Load incidents data and cleans it using pandas"""
    # Load only interesting columns for the analysis
    cols = ['folio', 'fecha_creacion', 'hora_creacion',
            'dia_semana', 'codigo_cierre', 'fecha_cierre',
            'incidente_c4', 'latitud', 'longitud',
            'clas_con_f_alarma', 'tipo_entrada']
    df = pd.read_csv(path, sep=';', usecols=cols, 
                     dtype={'tipo_entrada': 'category',
                            'clas_con_f_alarma': 'category'})
    fecha_creacion_str = df.fecha_creacion.str.cat(df.hora_creacion, sep=' ')
    # in some records the years 2018 and 2019  appeared as  18 and 19
    fecha_creacion_str = fecha_creacion_str.str.replace('/19', '/2019')
    fecha_creacion_str = fecha_creacion_str.str.replace('/18', '/2018')
    fecha_cierre_str = df.fecha_cierre.str.replace('/19', '/2019')
    df = df.assign(
        dia_semana=pd.Categorical(df.dia_semana,
                                  categories=['Lunes', 'Martes', 'Miércoles',
                                              'Jueves', 'Viernes', 'Sábado', 'Domingo'],
                                  ordered=True),
        fecha_creacion=pd.to_datetime(fecha_creacion_str, format='%d/%m/%Y %H:%M:%S'),
        fecha_cierre=pd.to_datetime(fecha_cierre_str, format='%d/%m/%Y'),
    ).drop('hora_creacion', axis=1)
    df = df.assign(hora=df.fecha_creacion.dt.hour)
    # There are some records without latitude or longitude
    df = df.loc[~df.latitud.isna() & ~df.longitud.isna()]
    if keep_confirmed:
        df = df.loc[df.codigo_cierre.str.startswith('(A)') | df.codigo_cierre.str.startswith('(I)')]
    df = df.reset_index(drop=True)
    return df


df_incidents = load_incidents('../data/incidentes-viales-c5.csv')
print(df_incidents.shape)
df_incidents.head()

(517196, 11)

Out[3]:

	folio	fecha_creacion	dia_semana	codigo_cierre	fecha_cierre	incidente_c4	latitud	longitud	clas_con_f_alarma	tipo_entrada	hora
0	C4/150220/03416	2015-02-20 17:09:02	Viernes	(A) La unidad de atención a emergencias fue de...	2015-02-20	lesionado-atropellado	19.486470	-99.155960	URGENCIAS MEDICAS	LLAMADA DEL 066	17
1	C4/150220/01020	2015-02-20 08:16:01	Viernes	(A) La unidad de atención a emergencias fue de...	2015-02-20	accidente-choque sin lesionados	19.339200	-99.043190	EMERGENCIA	LLAMADA DEL 066	8
2	C4/150220/01357	2015-02-20 09:57:51	Viernes	(A) La unidad de atención a emergencias fue de...	2015-02-20	accidente-choque con lesionados	19.276110	-99.166080	URGENCIAS MEDICAS	LLAMADA DEL 066	9
3	CH/150220/00015	2015-02-20 00:03:33	Viernes	(A) La unidad de atención a emergencias fue de...	2015-02-20	accidente-choque con lesionados	19.405648	-99.128962	URGENCIAS MEDICAS	BOTÓN DE AUXILIO	0
4	C4/150220/04687	2015-02-20 20:58:35	Viernes	(A) La unidad de atención a emergencias fue de...	2015-02-20	accidente-choque sin lesionados	19.501250	-99.123670	EMERGENCIA	LLAMADA DEL 066	20

There are some incidents in the table that were not confirmed by the authorities or by any witnesses. I decided to get rid of these and removed them by keeping only the rows that have an (A), or (I) in the codigo_cierre column, which indicates that the report was verified. You can also check the table documentation on the website.

Let's start the analysis with a time series of incidents, the plot below shows that the number of verified reports per day since 2014, as you can see the number decreased heavily during 2015 and 2016. From then onwards, it has remained almost constant with some seasonal patterns due to holiday periods and weekends. Moreover, you can see a strange spike of reports during christmas eve of 2014. This may be due to the combination of the shopping frenzy and bad weather that caused several jams in the streets that day.

In [4]:

def incidents_time_series(df: pd.DataFrame, width: int, height: int):
    time_series = (df.resample('D', on='fecha_creacion').folio.count()
                     .reset_index()
                     .rename(columns={'folio': 'incidents'}))
    curve_plot = hv.Curve(time_series, kdims=['fecha_creacion'], vdims=['incidents']).opts(alpha=0.6)
    scatter_plot = hv.Scatter(time_series, kdims=['fecha_creacion'], vdims=['incidents']).opts(tools=['hover'])
    vline = hv.VLine(pd.to_datetime('2016-01-01')).opts(color='red', alpha=0.4)
    time_series_plot = (curve_plot * scatter_plot * vline).opts(show_grid=True, width=width, height=height)
    time_series_plot = time_series_plot.redim.range(incidents=(0, 800))
    return time_series_plot


incidents_ts = incidents_time_series(df_incidents, 1000, 450)
incidents_ts = incidents_ts 
HTML(file_html(hv.render(incidents_ts), CDN))

Out[4]:

Bokeh Application

For the rest of the analysis I will only consider the data corresponding to the 2016-2019 period. The reason behind this is that the trend within this time frame seems to share the same behavior.

In [5]:

df_incidents = df_incidents.loc[df_incidents.fecha_creacion.dt.year >= 2016].reset_index(drop=True)
print(df_incidents.shape)
df_incidents.head()

(265152, 11)

Out[5]:

	folio	fecha_creacion	dia_semana	codigo_cierre	fecha_cierre	incidente_c4	latitud	longitud	clas_con_f_alarma	tipo_entrada	hora
0	BJ/160104/00916	2016-01-04 09:20:50	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.378201	-99.141777	EMERGENCIA	BOTÓN DE AUXILIO	9
1	C4/160104/03382	2016-01-04 19:15:29	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.338720	-99.190420	EMERGENCIA	LLAMADA DEL 066	19
2	C4/160104/03499	2016-01-04 19:42:46	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.426940	-99.092650	EMERGENCIA	LLAMADA DEL 066	19
3	C4/160104/03750	2016-01-04 20:34:41	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.356260	-99.134500	EMERGENCIA	LLAMADA DEL 066	20
4	C4/160104/02601	2016-01-04 16:30:52	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.295830	-99.227230	EMERGENCIA	LLAMADA DEL 066	16

The following plot shows the number of incidents per type (incidente_c4). As we can see, most of the incidents in this dataset are car crashes without people injured. However, the values of this column also need to been cleaned since we have same repeated categories written differently.

In [6]:

def incidents_by_type(df, width: int, height: int):
    df_types = (df.groupby('incidente_c4', as_index=False).folio.count()
                  .rename(columns={'folio': 'incidents'})
                  .sort_values('incidents'))
    bars = hv.Bars(df_types, kdims=['incidente_c4'], vdims=['incidents'])
    bars = bars.opts(width=width, height=height, invert_axes=True)
    return bars

incidents_types = incidents_by_type(df_incidents, 900, 400)
HTML(file_html(hv.render(incidents_types), CDN))

Out[6]:

Bokeh Application

Now let's explore when incidents occur. For this we can use a heatmap. In the following plot the x axis represents the hour of the day and the y axis represents the day of the week. We can see that most of the accidents occur during friday nights. A possible cause of this may be the people that decided to drive while drunk. Also, there is a clear intensity pattern during the mornings and afternoons of labor days that is probably related to the amount of people that commute to work in the morning and return home in the afternoons.

In [7]:

def incidents_heatmap(df: pd.DataFrame, width: int, height: int):
    """Returns a heatmap where the x axis corresponds to the hour,
    the y axis to the day of the week and the color represents the
    number of incidentes"""
    df = (df.groupby(['dia_semana', 'hora'], as_index=False).folio.count()
            .rename(columns={'folio': 'incidents'})
            .sort_values(['dia_semana', 'hora']))
    heatmap = hv.HeatMap(df, kdims=['hora', 'dia_semana'], vdims=['incidents'])
    heatmap = heatmap.opts(cmap='viridis', width=width, height=height, colorbar=True).opts(active_tools=['reset'])
    labels = hv.Labels(heatmap).opts(text_font_size='9pt')
    heatmap = (heatmap * labels)
    return heatmap


incidents_hm = incidents_heatmap(df_incidents, 1000, 400)
HTML(file_html(hv.render(incidents_hm), CDN))

Out[7]:

Bokeh Application

Another useful plot is the spatial distribution of the reports (a bivariate histogram). This allows us to observe where which regions of the city have the most accidents. Using geoviews it is easy to overlay that distribution on top of a map, as shown below.

We can observe that the majority of incidents happen near downtown, where most of the companies and business are located. The high density of people and cars around that area generate more chaotic environments, which makes accidents more probable.

In [8]:

def hextile_map(df: pd.DataFrame, width: int, height: int):
    hextiles = gv.HexTiles(df, kdims=['longitud', 'latitud'])
    hextiles = hextiles.opts(alpha=0.5, colorbar=True, tools=['hover'])
    tiles = gv.tile_sources.Wikipedia.opts(width=width, height=height, 
                                           xaxis=None, yaxis=None)
    accidents_map = hextiles * tiles
    return accidents_map


incidents_map = hextile_map(df_incidents, 1000, 700)
HTML(file_html(hv.render(incidents_map), CDN))

Out[8]:

Bokeh Application

Finally, I want to determine which places of the city (street crossings, avenues or highways) have a high concentration of incidents. This is to say, which parts of the city would benefit more from more attention from the local authorities in order to decrease the total number of accidents. To achieve this I will use a clustering algorithm called HDBSCAN on the coordinates of the incidents. HDBSCAN, unlike other clustering algorithms, doesn't require you to set the number of clusters, the algorithm itself estimates this from the data. If you want to learn more about HDBSCAN I highly recommend that you watch this talk from PyData New York 2018 where the presenter explains the logic behind it and some of the implementation details. It is a really cool talk.

In [9]:

def calculate_clusters(df: pd.DataFrame, min_cluster_size: int = 15):
    """Calculate clusters using the coordinates of the incidents"""
    # Haversine distance expects latitude, longitude in radians
    coords_radians = np.deg2rad(df.loc[:, ['latitud', 'longitud']].to_numpy())
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
                                core_dist_n_jobs=1,
                                metric='haversine',
                                approx_min_span_tree=False)
    clusterer.fit(coords_radians)
    return clusterer


clusterer = calculate_clusters(df_incidents)
df_incidents = df_incidents.assign(cluster_number=clusterer.labels_)
print(df_incidents.shape)
df_incidents.head(3)

(265152, 12)

Out[9]:

	folio	fecha_creacion	dia_semana	codigo_cierre	fecha_cierre	incidente_c4	latitud	longitud	clas_con_f_alarma	tipo_entrada	hora	cluster_number
0	BJ/160104/00916	2016-01-04 09:20:50	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.378201	-99.141777	EMERGENCIA	BOTÓN DE AUXILIO	9	6049
1	C4/160104/03382	2016-01-04 19:15:29	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.338720	-99.190420	EMERGENCIA	LLAMADA DEL 066	19	814
2	C4/160104/03499	2016-01-04 19:42:46	Lunes	(A) La unidad de atención a emergencias fue de...	2016-01-04	accidente-choque sin lesionados	19.426940	-99.092650	EMERGENCIA	LLAMADA DEL 066	19	4048

Once we have every coordinate assigned to a cluster, or marked as noise, we can group the points by the assigned cluster to get the biggest ones. For this analysis I will only keep the first one hundred clusters. In order to plot them on a map, we take the average of the coordinates of each cluster to get a centroid and then scale its size and color according to the number of its members.

In [10]:

def get_biggest_clusters(df, n_clusters: int = 100) -> pd.DataFrame:
    """Returns a dataframe with the clusters with higher number of points"""
    df = (df.loc[df.cluster_number != -1]
            .groupby('cluster_number', as_index=False).folio.count()
            .rename(columns={'folio': 'points_in_cluster'})
            .nlargest(n_clusters, columns='points_in_cluster', keep='all')
            .reset_index(drop=True))
    return df


def clusters_of_incidents(df: pd.DataFrame, n_clusters: int = 100, width: int = 1000, height: int = 800):
    df_clusters = get_biggest_clusters(df, n_clusters)
    df_points = df.loc[df.cluster_number.isin(df_clusters.cluster_number), ['longitud', 'latitud', 'cluster_number']]
    df_centroids = df_points.groupby('cluster_number', as_index=False).mean()
    df_clusters = pd.merge(df_centroids, df_clusters, on='cluster_number')
    
    tiles = gv.tile_sources.Wikipedia.opts(width=width, height=height, xaxis=None, yaxis=None)
    clusters = gv.Points(df_clusters, kdims=['longitud', 'latitud'], vdims=['points_in_cluster', 'cluster_number'])
    clusters = clusters.opts(color='points_in_cluster',
                             tools=['hover'],
                             alpha=0.7,
                             colorbar=True, 
                             cmap='viridis',
                             size=gv.dim('points_in_cluster') / 5,
                             logz=True)
    clusters_on_map = tiles * clusters
    return clusters_on_map


clusters_map = clusters_of_incidents(df_incidents, n_clusters=100, width=1000, height=700)
HTML(file_html(hv.render(clusters_map), CDN))

Out[10]:

Bokeh Application

This way, by analyzing the information made available by the city, we can say that the three largest clusters of accidents within the city are:

A street crossing near to Central de abasto.
The roundabout betweent Miguel Ángel de Quevedo and Universidad
The crossing of Periférico Sur and Canal de Chalco

With this knowledge, the local government could implement new measures, such as send more police officers to those clusters to improve the control of vehicular flow, or try different strategies that could reduce the overall number of incidents.

Acknowledgements¶

I want to thank José Manuel Guevara Vela (@manuel_bot) for his great feedback on this post.

Exploratory analysis of reported road incidents in Mexico City¶

Acknowledgements¶

Comments