Solving the Monty Hall problem with Pyro

Monty Hall problem

In this notebook we will explore the Monty Hall problem and we will solve it using Pyro. You can run the code in this notebook with binder using this link.

Watch this video if you don't know about this problem. You can also read more about its history in the wikipedia page.

In [1]:
import daft
import torch
import pyro
import math
import pyro.distributions as dist
import matplotlib.pyplot as plt
from pyro.infer import Importance, EmpiricalMarginal
from typing import Dict, List
from collections import Counter

pyro.set_rng_seed(101)
pyro.enable_validation(True)

At first glance it is not obvious to note that the best strategy is always to switch to the other door but from the video we learned that this calculation can be done with pen and paper. Here we are going to solve it with Pyro to see that we can get to same answer using a probabilistic programming tool.

Pyro requieres that we program the generative process of the problem (the model) using simple annotations. Then we condition the model on observed variables (in this case are the constestant's door and the host's door) with special functions to get a new constrained model, which will be the input to an inference algorithm that will calculate a posterior distribution over the unobserved variable (the prize's door).

The generative process of this problem is the following:

  1. The prize is placed behind one of the 3 doors at random. This information is only available to the host.
  2. The contestant select one of the doors without open it. The host observes the selected door.
  3. The host choose a door that doesn't contain the prize and that it is different from the contestant's door.
  4. Finally, the contestan has to choose between keeping her initial choice or pick the remaining door.

Lets start by programming a function that calculates the probability that a door will be opened by the host.

In [2]:
def host_conditional_probs(prize_door: int,
                           contestant_door: int) -> List[float]:
    """Given the location of the prize and the contestant's door, 
    the function returns the probabilities over the choices 
    of the host."""
    if prize_door == contestant_door:
        probs = [0.5, 0.5, 0.5]
        probs[contestant_door] = 0.
        return probs
    probs = [1., 1., 1.]
    probs[contestant_door] = 0.
    probs[prize_door] = 0.
    return probs

Now lets test this function in two different scenarios in order to see the possible choices of the host given the inputs:

  • the contestant selects the door with the prize
  • the contestant selects a door without the prize

As you can see when the prize and the contestant's door are different the host has only one possible choice.

In [3]:
# door ∈ {0, 1, 2}
prize_door = 0
contestant_door = 0
assert host_conditional_probs(prize_door, contestant_door) == [0, 0.5, 0.5]

prize_door = 0
contestant_door = 1
assert host_conditional_probs(prize_door, contestant_door) == [0, 0, 1]

We can also describe the problem as a probabilistic graphical model (PGM) where the location of the prize and the choices made by the contestant and the host are random variables represented by nodes in the graph below. The edges represent dependencies between these variables, and as you can see from the graph the choice of the host depends on the location of the prize and the contestant's choice. In this PGM the nodes that are shaded indicate that the variables are observed (by the contestant, since it is the person interested in solving the problem).

In [4]:
pgm = daft.PGM()
scale = 3
pgm.add_node("prize", "prize_door", 0.5, 2, scale=scale)
pgm.add_node("contestant", "contestant_door", 2.5, 2, scale=scale, observed=True)
pgm.add_node("host", "host_door", 1.5, 0.25, scale=scale, observed=True)
pgm.add_edge("prize", "host")
pgm.add_edge("contestant", "host")
pgm.render();

If your familiar with PGMs you will notice that the graph is a v-strutucture (collider) which means that if the child node is observed the parent nodes will become dependent. So it is through the host's choice that we can learn something about the location of the prize.

Now, lets code the model using Pyro.

In [5]:
def monty_hall_model(contestant_door: int):
    """Monty Hall model"""
    probs_loc_prize = torch.tensor([1/3, 1/3, 1/3])
    # assign the prize to a door with equal probability
    prize_door = pyro.sample('prize_door', dist.Categorical(probs_loc_prize))
    host_probs = host_conditional_probs(int(prize_door), contestant_door)
    host_door = pyro.sample(
        'host_door', dist.Categorical(torch.tensor(host_probs))
    )
    return {'prize_door': prize_door, 'host_door': host_door}


contestant_door = 0
monty_hall_model(contestant_door)
Out[5]:
{'prize_door': tensor(1), 'host_door': tensor(2)}

If we run the model multiple times like a normal function we will get feasible scenarios of the problem. Here the probability distribution of the prize's door is uniform among the the 3 choices. Lets simulate the model 1000 times and check that it is the case:

In [6]:
n_samples = 1000
samples = [
    monty_hall_model(contestant_door)['prize_door'].item()
    for _ in range(n_samples)
]
counter = Counter(samples)
{f'prob_door_{door}': c/n_samples for door, c in counter.items()}
Out[6]:
{'prob_door_0': 0.338, 'prob_door_1': 0.315, 'prob_door_2': 0.347}

But what happens if we condition our model on knowing the value of the host's choice? We get a new model with additional contrains. In Pyro we can condition a model on new data using the pyro.condition poutine.

In [7]:
host_door = 1

# Conditioned model on host's door
cond_model = pyro.condition(
    monty_hall_model, data={"host_door": torch.tensor(host_door)}
)

Now we only need to run an (approximate) inference algorithm to estimate the posterior distribution of the unobserved variable. We will use importance sampling, which is one of the algorithms available in the library.

In [8]:
contestant_door = 0
# the doors have to be different
assert contestant_door != host_door
# approximate inference
n_samples = 1000
traces = Importance(cond_model, num_samples=n_samples).run(contestant_door)
prize_marginal = EmpiricalMarginal(traces, ["prize_door"])
samples = [prize_marginal().detach().item() for _ in range(n_samples)]

Finally, lets plot the posterior distribution of the location of the prize

In [9]:
def calculate_posterior_probs(samples: List[int]) -> Dict[int, float]:
    doors = [0, 1, 2]
    counter = Counter(samples)
    probs = {d: counter.get(d, 0) / len(samples) for d in doors}
    return probs


def posterior_plot(probs: Dict[int, float]):
    fig, ax = plt.subplots(1, 1)
    doors = sorted(probs.keys())
    heights = [probs[d] for d in doors]
    ax.bar(doors, heights)
    ax.set_title('Probability of the location of the prize')
    ax.set_ylabel('Probability')
    ax.set_ylim(0, 1)
    plt.xticks(doors, ('#0', '#1', '#2'))


posterior = calculate_posterior_probs(samples)
assert sum(posterior.values()) == 1
assert posterior[1] == 0
assert math.isclose(posterior[2], 2/3, abs_tol=0.03)
posterior_plot(posterior)

From the plot it is possible to see that probability is higher for door #2 so it is better to switch in order to have a better chance of finding the prize. Also note that the probability for door #1 is zero, this makes sense since it is the door chosen by the host.

Exploratory analysis of reported road incidents in Mexico City

Exploratory analysis of reported road incidents in Mexico City

In january of 2019 Mexico City's government made available to the public (https://datos.cdmx.gob.mx/pages/home/) several datasets about the city's public transport, services and security. This makes possible to explore and use those datasets to get interesting insights. I decided to give it a try and analyze the road incidents dataset in order to help to formulate strategies that could potentially improve the quality of life in the city. For this I considered two aspects; which regions of the city are more prone to vehicular accidents, and when do these accidents occur.

Lately, I've been using holoviews and geoviews for my projects and I love how easy they make to produce interactive plots. I can't recommend those libraries enough. I will use them during this blog post, if you want to learn more check this site http://holoviz.org/ for a detailed tutorial.

In [2]:
import hdbscan
import numpy as np
import pandas as pd
import holoviews as hv
import geoviews as gv
from bokeh.resources import CDN
from bokeh.embed import file_html
from IPython.display import HTML

hv.extension('bokeh')

I downloaded the dataset as CSV file (incidentes-viales-c5.csv) and I use the following function to load and clean the table:

In [3]:
def load_incidents(path, keep_confirmed: bool = True) -> pd.DataFrame:
    """Load incidents data and cleans it using pandas"""
    # Load only interesting columns for the analysis
    cols = ['folio', 'fecha_creacion', 'hora_creacion',
            'dia_semana', 'codigo_cierre', 'fecha_cierre',
            'incidente_c4', 'latitud', 'longitud',
            'clas_con_f_alarma', 'tipo_entrada']
    df = pd.read_csv(path, sep=';', usecols=cols, 
                     dtype={'tipo_entrada': 'category',
                            'clas_con_f_alarma': 'category'})
    fecha_creacion_str = df.fecha_creacion.str.cat(df.hora_creacion, sep=' ')
    # in some records the years 2018 and 2019  appeared as  18 and 19
    fecha_creacion_str = fecha_creacion_str.str.replace('/19', '/2019')
    fecha_creacion_str = fecha_creacion_str.str.replace('/18', '/2018')
    fecha_cierre_str = df.fecha_cierre.str.replace('/19', '/2019')
    df = df.assign(
        dia_semana=pd.Categorical(df.dia_semana,
                                  categories=['Lunes', 'Martes', 'Miércoles',
                                              'Jueves', 'Viernes', 'Sábado', 'Domingo'],
                                  ordered=True),
        fecha_creacion=pd.to_datetime(fecha_creacion_str, format='%d/%m/%Y %H:%M:%S'),
        fecha_cierre=pd.to_datetime(fecha_cierre_str, format='%d/%m/%Y'),
    ).drop('hora_creacion', axis=1)
    df = df.assign(hora=df.fecha_creacion.dt.hour)
    # There are some records without latitude or longitude
    df = df.loc[~df.latitud.isna() & ~df.longitud.isna()]
    if keep_confirmed:
        df = df.loc[df.codigo_cierre.str.startswith('(A)') | df.codigo_cierre.str.startswith('(I)')]
    df = df.reset_index(drop=True)
    return df


df_incidents = load_incidents('../data/incidentes-viales-c5.csv')
print(df_incidents.shape)
df_incidents.head()
(517196, 11)
Out[3]:
folio fecha_creacion dia_semana codigo_cierre fecha_cierre incidente_c4 latitud longitud clas_con_f_alarma tipo_entrada hora
0 C4/150220/03416 2015-02-20 17:09:02 Viernes (A) La unidad de atención a emergencias fue de... 2015-02-20 lesionado-atropellado 19.486470 -99.155960 URGENCIAS MEDICAS LLAMADA DEL 066 17
1 C4/150220/01020 2015-02-20 08:16:01 Viernes (A) La unidad de atención a emergencias fue de... 2015-02-20 accidente-choque sin lesionados 19.339200 -99.043190 EMERGENCIA LLAMADA DEL 066 8
2 C4/150220/01357 2015-02-20 09:57:51 Viernes (A) La unidad de atención a emergencias fue de... 2015-02-20 accidente-choque con lesionados 19.276110 -99.166080 URGENCIAS MEDICAS LLAMADA DEL 066 9
3 CH/150220/00015 2015-02-20 00:03:33 Viernes (A) La unidad de atención a emergencias fue de... 2015-02-20 accidente-choque con lesionados 19.405648 -99.128962 URGENCIAS MEDICAS BOTÓN DE AUXILIO 0
4 C4/150220/04687 2015-02-20 20:58:35 Viernes (A) La unidad de atención a emergencias fue de... 2015-02-20 accidente-choque sin lesionados 19.501250 -99.123670 EMERGENCIA LLAMADA DEL 066 20

There are some incidents in the table that were not confirmed by the authorities or by any witnesses. I decided to get rid of these and removed them by keeping only the rows that have an (A), or (I) in the codigo_cierre column, which indicates that the report was verified. You can also check the table documentation on the website.

Let's start the analysis with a time series of incidents, the plot below shows that the number of verified reports per day since 2014, as you can see the number decreased heavily during 2015 and 2016. From then onwards, it has remained almost constant with some seasonal patterns due to holiday periods and weekends. Moreover, you can see a strange spike of reports during christmas eve of 2014. This may be due to the combination of the shopping frenzy and bad weather that caused several jams in the streets that day.

In [4]:
def incidents_time_series(df: pd.DataFrame, width: int, height: int):
    time_series = (df.resample('D', on='fecha_creacion').folio.count()
                     .reset_index()
                     .rename(columns={'folio': 'incidents'}))
    curve_plot = hv.Curve(time_series, kdims=['fecha_creacion'], vdims=['incidents']).opts(alpha=0.6)
    scatter_plot = hv.Scatter(time_series, kdims=['fecha_creacion'], vdims=['incidents']).opts(tools=['hover'])
    vline = hv.VLine(pd.to_datetime('2016-01-01')).opts(color='red', alpha=0.4)
    time_series_plot = (curve_plot * scatter_plot * vline).opts(show_grid=True, width=width, height=height)
    time_series_plot = time_series_plot.redim.range(incidents=(0, 800))
    return time_series_plot


incidents_ts = incidents_time_series(df_incidents, 1000, 450)
incidents_ts = incidents_ts 
HTML(file_html(hv.render(incidents_ts), CDN))
Out[4]:
Bokeh Application

For the rest of the analysis I will only consider the data corresponding to the 2016-2019 period. The reason behind this is that the trend within this time frame seems to share the same behavior.

In [5]:
df_incidents = df_incidents.loc[df_incidents.fecha_creacion.dt.year >= 2016].reset_index(drop=True)
print(df_incidents.shape)
df_incidents.head()
(265152, 11)
Out[5]:
folio fecha_creacion dia_semana codigo_cierre fecha_cierre incidente_c4 latitud longitud clas_con_f_alarma tipo_entrada hora
0 BJ/160104/00916 2016-01-04 09:20:50 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.378201 -99.141777 EMERGENCIA BOTÓN DE AUXILIO 9
1 C4/160104/03382 2016-01-04 19:15:29 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.338720 -99.190420 EMERGENCIA LLAMADA DEL 066 19
2 C4/160104/03499 2016-01-04 19:42:46 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.426940 -99.092650 EMERGENCIA LLAMADA DEL 066 19
3 C4/160104/03750 2016-01-04 20:34:41 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.356260 -99.134500 EMERGENCIA LLAMADA DEL 066 20
4 C4/160104/02601 2016-01-04 16:30:52 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.295830 -99.227230 EMERGENCIA LLAMADA DEL 066 16

The following plot shows the number of incidents per type (incidente_c4). As we can see, most of the incidents in this dataset are car crashes without people injured. However, the values of this column also need to been cleaned since we have same repeated categories written differently.

In [6]:
def incidents_by_type(df, width: int, height: int):
    df_types = (df.groupby('incidente_c4', as_index=False).folio.count()
                  .rename(columns={'folio': 'incidents'})
                  .sort_values('incidents'))
    bars = hv.Bars(df_types, kdims=['incidente_c4'], vdims=['incidents'])
    bars = bars.opts(width=width, height=height, invert_axes=True)
    return bars

incidents_types = incidents_by_type(df_incidents, 900, 400)
HTML(file_html(hv.render(incidents_types), CDN))
Out[6]:
Bokeh Application

Now let's explore when incidents occur. For this we can use a heatmap. In the following plot the x axis represents the hour of the day and the y axis represents the day of the week. We can see that most of the accidents occur during friday nights. A possible cause of this may be the people that decided to drive while drunk. Also, there is a clear intensity pattern during the mornings and afternoons of labor days that is probably related to the amount of people that commute to work in the morning and return home in the afternoons.

In [7]:
def incidents_heatmap(df: pd.DataFrame, width: int, height: int):
    """Returns a heatmap where the x axis corresponds to the hour,
    the y axis to the day of the week and the color represents the
    number of incidentes"""
    df = (df.groupby(['dia_semana', 'hora'], as_index=False).folio.count()
            .rename(columns={'folio': 'incidents'})
            .sort_values(['dia_semana', 'hora']))
    heatmap = hv.HeatMap(df, kdims=['hora', 'dia_semana'], vdims=['incidents'])
    heatmap = heatmap.opts(cmap='viridis', width=width, height=height, colorbar=True).opts(active_tools=['reset'])
    labels = hv.Labels(heatmap).opts(text_font_size='9pt')
    heatmap = (heatmap * labels)
    return heatmap


incidents_hm = incidents_heatmap(df_incidents, 1000, 400)
HTML(file_html(hv.render(incidents_hm), CDN))
Out[7]:
Bokeh Application

Another useful plot is the spatial distribution of the reports (a bivariate histogram). This allows us to observe where which regions of the city have the most accidents. Using geoviews it is easy to overlay that distribution on top of a map, as shown below.

We can observe that the majority of incidents happen near downtown, where most of the companies and business are located. The high density of people and cars around that area generate more chaotic environments, which makes accidents more probable.

In [8]:
def hextile_map(df: pd.DataFrame, width: int, height: int):
    hextiles = gv.HexTiles(df, kdims=['longitud', 'latitud'])
    hextiles = hextiles.opts(alpha=0.5, colorbar=True, tools=['hover'])
    tiles = gv.tile_sources.Wikipedia.opts(width=width, height=height, 
                                           xaxis=None, yaxis=None)
    accidents_map = hextiles * tiles
    return accidents_map


incidents_map = hextile_map(df_incidents, 1000, 700)
HTML(file_html(hv.render(incidents_map), CDN))
Out[8]:
Bokeh Application

Finally, I want to determine which places of the city (street crossings, avenues or highways) have a high concentration of incidents. This is to say, which parts of the city would benefit more from more attention from the local authorities in order to decrease the total number of accidents. To achieve this I will use a clustering algorithm called HDBSCAN on the coordinates of the incidents. HDBSCAN, unlike other clustering algorithms, doesn't require you to set the number of clusters, the algorithm itself estimates this from the data. If you want to learn more about HDBSCAN I highly recommend that you watch this talk from PyData New York 2018 where the presenter explains the logic behind it and some of the implementation details. It is a really cool talk.

In [9]:
def calculate_clusters(df: pd.DataFrame, min_cluster_size: int = 15):
    """Calculate clusters using the coordinates of the incidents"""
    # Haversine distance expects latitude, longitude in radians
    coords_radians = np.deg2rad(df.loc[:, ['latitud', 'longitud']].to_numpy())
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
                                core_dist_n_jobs=1,
                                metric='haversine',
                                approx_min_span_tree=False)
    clusterer.fit(coords_radians)
    return clusterer


clusterer = calculate_clusters(df_incidents)
df_incidents = df_incidents.assign(cluster_number=clusterer.labels_)
print(df_incidents.shape)
df_incidents.head(3)
(265152, 12)
Out[9]:
folio fecha_creacion dia_semana codigo_cierre fecha_cierre incidente_c4 latitud longitud clas_con_f_alarma tipo_entrada hora cluster_number
0 BJ/160104/00916 2016-01-04 09:20:50 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.378201 -99.141777 EMERGENCIA BOTÓN DE AUXILIO 9 6049
1 C4/160104/03382 2016-01-04 19:15:29 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.338720 -99.190420 EMERGENCIA LLAMADA DEL 066 19 814
2 C4/160104/03499 2016-01-04 19:42:46 Lunes (A) La unidad de atención a emergencias fue de... 2016-01-04 accidente-choque sin lesionados 19.426940 -99.092650 EMERGENCIA LLAMADA DEL 066 19 4048

Once we have every coordinate assigned to a cluster, or marked as noise, we can group the points by the assigned cluster to get the biggest ones. For this analysis I will only keep the first one hundred clusters. In order to plot them on a map, we take the average of the coordinates of each cluster to get a centroid and then scale its size and color according to the number of its members.

In [10]:
def get_biggest_clusters(df, n_clusters: int = 100) -> pd.DataFrame:
    """Returns a dataframe with the clusters with higher number of points"""
    df = (df.loc[df.cluster_number != -1]
            .groupby('cluster_number', as_index=False).folio.count()
            .rename(columns={'folio': 'points_in_cluster'})
            .nlargest(n_clusters, columns='points_in_cluster', keep='all')
            .reset_index(drop=True))
    return df


def clusters_of_incidents(df: pd.DataFrame, n_clusters: int = 100, width: int = 1000, height: int = 800):
    df_clusters = get_biggest_clusters(df, n_clusters)
    df_points = df.loc[df.cluster_number.isin(df_clusters.cluster_number), ['longitud', 'latitud', 'cluster_number']]
    df_centroids = df_points.groupby('cluster_number', as_index=False).mean()
    df_clusters = pd.merge(df_centroids, df_clusters, on='cluster_number')
    
    tiles = gv.tile_sources.Wikipedia.opts(width=width, height=height, xaxis=None, yaxis=None)
    clusters = gv.Points(df_clusters, kdims=['longitud', 'latitud'], vdims=['points_in_cluster', 'cluster_number'])
    clusters = clusters.opts(color='points_in_cluster',
                             tools=['hover'],
                             alpha=0.7,
                             colorbar=True, 
                             cmap='viridis',
                             size=gv.dim('points_in_cluster') / 5,
                             logz=True)
    clusters_on_map = tiles * clusters
    return clusters_on_map


clusters_map = clusters_of_incidents(df_incidents, n_clusters=100, width=1000, height=700)
HTML(file_html(hv.render(clusters_map), CDN))
Out[10]:
Bokeh Application

This way, by analyzing the information made available by the city, we can say that the three largest clusters of accidents within the city are:

  1. A street crossing near to Central de abasto.
  2. The roundabout betweent Miguel Ángel de Quevedo and Universidad
  3. The crossing of Periférico Sur and Canal de Chalco

With this knowledge, the local government could implement new measures, such as send more police officers to those clusters to improve the control of vehicular flow, or try different strategies that could reduce the overall number of incidents.

Acknowledgements

I want to thank José Manuel Guevara Vela (@manuel_bot) for his great feedback on this post.