Anomaly Detection - Kmeans

In [1]:

import turicreate as tc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.style.use('ggplot')

Sometimes is normal to find databases or datasets in web repositories, in general, the most common way to access is via API, in many other cases, the data can be accessed through the link URL.

We utilize the Dutch portal for official distribution of data sources, part of the Ministry of the Interior and Kingdom Relations web page, where we can find tons of datasets in different formats, for many distinct applications.

It is important to automate the collection of the data. A good method is creating a function to make the web service request, as shown below.

For convenience, we are going to focus on crime data in this example.

Deaths; murder, crime scene in The Netherlands

This table contains the number of persons died as a result of murder or manslaughter, where the crime scene is located in the Netherlands. The victims can be residents or non-residents. The data can be split by location of the crime, method, age and sex. The date of death is the criterion, the date of the act can be in the previous year. The ICD10 codes that belong to murder and manslaughter are X85-Y09.

Open Data Source License CC-BY 4.0

In [2]:

df_crimes = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/crimes-netherlands.csv', sep=',')

The municipalities names, region and population are missing, we might go for an additional source

In [3]:

# Sometime you just need a little bit of imagination, List of municipalities
df_population = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/population-netherlands.tsv', sep='\t')

Explore Distribution

How is related the crime and the types of crime with the population?

Try .head() .describe()

In [4]:

plt.figure(figsize=[16, 8])
sns.kdeplot(df_population['Population'], shade=True, color="r", label='Population')
plt.title('Distribution of Population The Netherlands'); plt.legend()
/Users/pedrohserrano/anaconda3/envs/py35r/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:475: DeprecationWarning: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
  grid,delta = np.linspace(a,b,gridsize,retstep=True)





<matplotlib.legend.Legend at 0x1a11280d68>

png

Datasets Merge

In [5]:

# Merge the crimes table and population table
df_crime_pop = pd.merge(df_crimes, df_population, on='CBScode', how='left')
df_crime_pop = df_crime_pop.replace([np.inf, -np.inf], np.nan)
df_crime_pop.fillna(0, inplace=True)

In [6]:

df_crime_pop.sort_values('CBScode').head(10)
CBScode Perioden HIC: Theft / burglary dwelling, complete HIC: Theft / burglary dwelling, attempts HIC: Violent Crime HIC: Street Roof HIC: Robberies Undermining public order Threat Fire / Explosion ... mistreatment Overt violence (person) Or destruction. cause damage Arms Trade pickpocketing morals Felony Municipality Province Population Population_density(p/km)
0 GM0003 2016JJ00 24.0 12.0 67.0 0.0 1.0 3.0 27.0 0.0 ... 40.0 0.0 54.0 1.0 3.0 3.0 Appingedam Groningen 12049.0 507.0
1 GM0005 2016JJ00 10.0 2.0 19.0 0.0 0.0 0.0 5.0 1.0 ... 14.0 0.0 44.0 0.0 1.0 6.0 Bedum Groningen 10475.0 236.0
2 GM0007 2016JJ00 13.0 4.0 30.0 0.0 0.0 3.0 12.0 3.0 ... 18.0 0.0 35.0 1.0 0.0 0.0 Bellingwedde Groningen 8908.0 86.0
3 GM0009 2016JJ00 7.0 0.0 9.0 0.0 0.0 0.0 4.0 0.0 ... 5.0 0.0 24.0 0.0 1.0 0.0 Boer, TenTen Boer Groningen 7465.0 165.0
4 GM0010 2016JJ00 49.0 27.0 158.0 1.0 1.0 6.0 49.0 9.0 ... 100.0 9.0 151.0 7.0 9.0 18.0 Delfzijl Groningen 25686.0 198.0
5 GM0014 2016JJ00 696.0 181.0 1454.0 46.0 16.0 74.0 502.0 52.0 ... 906.0 46.0 1506.0 50.0 574.0 171.0 Groningen Groningen 198108.0 2474.0
6 GM0015 2016JJ00 12.0 2.0 18.0 0.0 0.0 0.0 4.0 3.0 ... 12.0 2.0 34.0 0.0 0.0 2.0 Grootegast Groningen 12193.0 141.0
7 GM0017 2016JJ00 63.0 20.0 42.0 1.0 0.0 1.0 24.0 3.0 ... 18.0 0.0 70.0 1.0 8.0 5.0 Haren Groningen 18790.0 405.0
8 GM0018 2016JJ00 72.0 21.0 163.0 1.0 3.0 4.0 63.0 6.0 ... 98.0 2.0 220.0 3.0 9.0 30.0 Hoogezand-Sappemeer Groningen 34360.0 521.0
9 GM0022 2016JJ00 33.0 5.0 41.0 0.0 2.0 0.0 15.0 3.0 ... 23.0 3.0 91.0 2.0 5.0 11.0 Leek Groningen 19607.0 307.0

10 rows × 27 columns

In [7]:

df_crime_pop.to_csv('', index=True)
---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-7-1a695f49106f> in <module>()
----> 1 df_crime_pop.to_csv('', index=True)


~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1522                                      doublequote=doublequote,
   1523                                      escapechar=escapechar, decimal=decimal)
-> 1524         formatter.save()
   1525 
   1526         if path_or_buf is None:


~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/io/formats/format.py in save(self)
   1635             f, handles = _get_handle(self.path_or_buf, self.mode,
   1636                                      encoding=encoding,
-> 1637                                      compression=self.compression)
   1638             close = True
   1639 


~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    388         elif encoding:
    389             # Python 3 and encoding
--> 390             f = open(path_or_buf, mode, encoding=encoding)
    391         elif is_text:
    392             # Python 3 and no explicit encoding


FileNotFoundError: [Errno 2] No such file or directory: ''

In [None]:

df_crime_pop.describe()

Next, let’s create a scatterplot matrix. Scatterplot matrices plot the distribution of each column along the diagonal, and then plot a scatterplot matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.

We can even have the plotting package color each entry by its class to look for trends within the classes.

In [None]:

df_plot = df_crime_pop[['HIC: Violent Crime','HIC: Street Roof', 'HIC: Robberies', 'Population_density(p/km)', 'Province']]
sns.pairplot(df_plot, hue="Province")
plt.title('Correlation of High Impact Crime')
plt.legend()

Correlation Violent crimes vs Population Density

In [None]:

df_plot = df_crime_pop[['HIC: Violent Crime','Arms Trade']]

In [None]:

df_plot = df_plot[(df_plot['Arms Trade'] < 10 ) & (df_plot['HIC: Violent Crime'] < 200) ]

In [None]:

plt.figure(figsize=[14, 7])
sns.kdeplot(df_plot['Arms Trade'], df_plot['HIC: Violent Crime'], cmap="Reds", shade=True, shade_lowest=False)
plt.title('Relation')

K-means Method

The most basic usage of K-means clustering requires only a choice for the number of clusters, . We rarely know the correct number of clusters a priori, but the following simple heuristic sometimes works well:

where is the number of rows in your dataset. By default, the maximum number of iterations is 10, and all features in the input dataset are used

In [None]:

sf = tc.SFrame(data=df_crime_pop.loc[:, df_crime_pop.columns != 'Population']._get_numeric_data())

Write the formula for k

In [None]:

print('Number of Clusters K: {}'.format(K))

In [None]:

kmeans_model = tc.kmeans.create(sf, num_clusters=K)

In [None]:

kmeans_model.summary

The model summary shows the usual fields about model schema, training time, and training iterations. It also shows that the K-means results are returned in two SFrames contained in the model: cluster_id and cluster_info. The cluster_info SFrame indicates the final cluster centers, one per row, in terms of the same features used to create the model.

The last three columns of the cluster_info SFrame indicate metadata about the corresponding cluster: ID number, number of points in the cluster, and the within-cluster sum of squared distances to the center.

In [None]:

kmeans_model.cluster_info[['cluster_id', 'size', 'sum_squared_distance']].print_rows(num_rows=14, num_columns=3)

The cluster_id field of the model shows the cluster assignment for each input data point, along with the Euclidean distance from the point to its assigned cluster’s center.

In [None]:

clusters = kmeans_model.cluster_id

Which are the anomalous points?

What do they have in common within anomalous clusters?

Written on August 16, 2018