Anomaly Detection - Kmeans

1. Regression with Python

2. Simple Linear Regression

3. Multiple Regression

4. Local Regression

5. Anomaly Detection - K means

6. Anomaly Detection - Outliers

In [1]:

import turicreate as tc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.style.use('ggplot')

Sometimes is normal to find databases or datasets in web repositories, in general, the most common way to access is via API, in many other cases, the data can be accessed through the link URL.

We utilize the Dutch portal for official distribution of data sources, part of the Ministry of the Interior and Kingdom Relations web page, where we can find tons of datasets in different formats, for many distinct applications.

It is important to automate the collection of the data. A good method is creating a function to make the web service request, as shown below.

For convenience, we are going to focus on crime data in this example.

Deaths; murder, crime scene in The Netherlands

This table contains the number of persons died as a result of murder or manslaughter, where the crime scene is located in the Netherlands. The victims can be residents or non-residents. The data can be split by location of the crime, method, age and sex. The date of death is the criterion, the date of the act can be in the previous year. The ICD10 codes that belong to murder and manslaughter are X85-Y09.

Open Data Source License CC-BY 4.0

In [2]:

df_crimes = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/crimes-netherlands.csv', sep=',')

The municipalities names, region and population are missing, we might go for an additional source

In [3]:

# Sometime you just need a little bit of imagination, List of municipalities
df_population = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/population-netherlands.tsv', sep='\t')

Explore Distribution

How is related the crime and the types of crime with the population?

Try .head() .describe()

In [4]:

plt.figure(figsize=[16, 8])
sns.kdeplot(df_population['Population'], shade=True, color="r", label='Population')
plt.title('Distribution of Population The Netherlands'); plt.legend()

/Users/pedrohserrano/anaconda3/envs/py35r/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:475: DeprecationWarning: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
  grid,delta = np.linspace(a,b,gridsize,retstep=True)

<matplotlib.legend.Legend at 0x1a11280d68>

png

Datasets Merge

In [5]:

# Merge the crimes table and population table
df_crime_pop = pd.merge(df_crimes, df_population, on='CBScode', how='left')
df_crime_pop = df_crime_pop.replace([np.inf, -np.inf], np.nan)
df_crime_pop.fillna(0, inplace=True)

In [6]:

df_crime_pop.sort_values('CBScode').head(10)

	CBScode	Perioden	HIC: Theft / burglary dwelling, complete	HIC: Theft / burglary dwelling, attempts	HIC: Violent Crime	HIC: Street Roof	HIC: Robberies	Undermining public order	Threat	Fire / Explosion	...	mistreatment	Overt violence (person)	Or destruction. cause damage	Arms Trade	pickpocketing	morals Felony	Municipality	Province	Population	Population_density(p/km)
0	GM0003	2016JJ00	24.0	12.0	67.0	0.0	1.0	3.0	27.0	0.0	...	40.0	0.0	54.0	1.0	3.0	3.0	Appingedam	Groningen	12049.0	507.0
1	GM0005	2016JJ00	10.0	2.0	19.0	0.0	0.0	0.0	5.0	1.0	...	14.0	0.0	44.0	0.0	1.0	6.0	Bedum	Groningen	10475.0	236.0
2	GM0007	2016JJ00	13.0	4.0	30.0	0.0	0.0	3.0	12.0	3.0	...	18.0	0.0	35.0	1.0	0.0	0.0	Bellingwedde	Groningen	8908.0	86.0
3	GM0009	2016JJ00	7.0	0.0	9.0	0.0	0.0	0.0	4.0	0.0	...	5.0	0.0	24.0	0.0	1.0	0.0	Boer, TenTen Boer	Groningen	7465.0	165.0
4	GM0010	2016JJ00	49.0	27.0	158.0	1.0	1.0	6.0	49.0	9.0	...	100.0	9.0	151.0	7.0	9.0	18.0	Delfzijl	Groningen	25686.0	198.0
5	GM0014	2016JJ00	696.0	181.0	1454.0	46.0	16.0	74.0	502.0	52.0	...	906.0	46.0	1506.0	50.0	574.0	171.0	Groningen	Groningen	198108.0	2474.0
6	GM0015	2016JJ00	12.0	2.0	18.0	0.0	0.0	0.0	4.0	3.0	...	12.0	2.0	34.0	0.0	0.0	2.0	Grootegast	Groningen	12193.0	141.0
7	GM0017	2016JJ00	63.0	20.0	42.0	1.0	0.0	1.0	24.0	3.0	...	18.0	0.0	70.0	1.0	8.0	5.0	Haren	Groningen	18790.0	405.0
8	GM0018	2016JJ00	72.0	21.0	163.0	1.0	3.0	4.0	63.0	6.0	...	98.0	2.0	220.0	3.0	9.0	30.0	Hoogezand-Sappemeer	Groningen	34360.0	521.0
9	GM0022	2016JJ00	33.0	5.0	41.0	0.0	2.0	0.0	15.0	3.0	...	23.0	3.0	91.0	2.0	5.0	11.0	Leek	Groningen	19607.0	307.0

10 rows × 27 columns

In [7]:

df_crime_pop.to_csv('', index=True)

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-7-1a695f49106f> in <module>()
----> 1 df_crime_pop.to_csv('', index=True)


~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1522                                      doublequote=doublequote,
   1523                                      escapechar=escapechar, decimal=decimal)
-> 1524         formatter.save()
   1525 
   1526         if path_or_buf is None:


~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/io/formats/format.py in save(self)
   1635             f, handles = _get_handle(self.path_or_buf, self.mode,
   1636                                      encoding=encoding,
-> 1637                                      compression=self.compression)
   1638             close = True
   1639 


~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    388         elif encoding:
    389             # Python 3 and encoding
--> 390             f = open(path_or_buf, mode, encoding=encoding)
    391         elif is_text:
    392             # Python 3 and no explicit encoding


FileNotFoundError: [Errno 2] No such file or directory: ''

In [None]:

df_crime_pop.describe()

Next, let’s create a scatterplot matrix. Scatterplot matrices plot the distribution of each column along the diagonal, and then plot a scatterplot matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.

We can even have the plotting package color each entry by its class to look for trends within the classes.

In [None]:

df_plot = df_crime_pop[['HIC: Violent Crime','HIC: Street Roof', 'HIC: Robberies', 'Population_density(p/km)', 'Province']]
sns.pairplot(df_plot, hue="Province")
plt.title('Correlation of High Impact Crime')
plt.legend()

Correlation Violent crimes vs Population Density

In [None]:

df_plot = df_crime_pop[['HIC: Violent Crime','Arms Trade']]

In [None]:

df_plot = df_plot[(df_plot['Arms Trade'] < 10 ) & (df_plot['HIC: Violent Crime'] < 200) ]

In [None]:

plt.figure(figsize=[14, 7])
sns.kdeplot(df_plot['Arms Trade'], df_plot['HIC: Violent Crime'], cmap="Reds", shade=True, shade_lowest=False)
plt.title('Relation')

K-means Method

The most basic usage of K-means clustering requires only a choice for the number of clusters, . We rarely know the correct number of clusters a priori, but the following simple heuristic sometimes works well:

where is the number of rows in your dataset. By default, the maximum number of iterations is 10, and all features in the input dataset are used

In [None]:

sf = tc.SFrame(data=df_crime_pop.loc[:, df_crime_pop.columns != 'Population']._get_numeric_data())

Write the formula for k

In [None]:

print('Number of Clusters K: {}'.format(K))

In [None]:

kmeans_model = tc.kmeans.create(sf, num_clusters=K)

In [None]:

kmeans_model.summary

The model summary shows the usual fields about model schema, training time, and training iterations. It also shows that the K-means results are returned in two SFrames contained in the model: cluster_id and cluster_info. The cluster_info SFrame indicates the final cluster centers, one per row, in terms of the same features used to create the model.

The last three columns of the cluster_info SFrame indicate metadata about the corresponding cluster: ID number, number of points in the cluster, and the within-cluster sum of squared distances to the center.

In [None]:

kmeans_model.cluster_info[['cluster_id', 'size', 'sum_squared_distance']].print_rows(num_rows=14, num_columns=3)

The cluster_id field of the model shows the cluster assignment for each input data point, along with the Euclidean distance from the point to its assigned cluster’s center.

In [None]:

clusters = kmeans_model.cluster_id

Which are the anomalous points?

What do they have in common within anomalous clusters?

Written on August 16, 2018