Anomaly Detection - Kmeans
1. Regression with Python
2. Simple Linear Regression
3. Multiple Regression
4. Local Regression
5. Anomaly Detection - K means
6. Anomaly Detection - Outliers
In [1]:
Sometimes is normal to find databases or datasets in web repositories, in general, the most common way to access is via API, in many other cases, the data can be accessed through the link URL.
We utilize the Dutch portal for official distribution of data sources, part of the Ministry of the Interior and Kingdom Relations web page, where we can find tons of datasets in different formats, for many distinct applications.
It is important to automate the collection of the data. A good method is creating a function to make the web service request, as shown below.
For convenience, we are going to focus on crime data in this example.
Deaths; murder, crime scene in The Netherlands
This table contains the number of persons died as a result of murder or manslaughter, where the crime scene is located in the Netherlands. The victims can be residents or non-residents. The data can be split by location of the crime, method, age and sex. The date of death is the criterion, the date of the act can be in the previous year. The ICD10 codes that belong to murder and manslaughter are X85-Y09.
Open Data Source License CC-BY 4.0
In [2]:
The municipalities names, region and population are missing, we might go for an additional source
In [3]:
Explore Distribution
How is related the crime and the types of crime with the population?
Try .head() .describe()
In [4]:
/Users/pedrohserrano/anaconda3/envs/py35r/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:475: DeprecationWarning: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
grid,delta = np.linspace(a,b,gridsize,retstep=True)
<matplotlib.legend.Legend at 0x1a11280d68>
Datasets Merge
In [5]:
In [6]:
CBScode | Perioden | HIC: Theft / burglary dwelling, complete | HIC: Theft / burglary dwelling, attempts | HIC: Violent Crime | HIC: Street Roof | HIC: Robberies | Undermining public order | Threat | Fire / Explosion | ... | mistreatment | Overt violence (person) | Or destruction. cause damage | Arms Trade | pickpocketing | morals Felony | Municipality | Province | Population | Population_density(p/km) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GM0003 | 2016JJ00 | 24.0 | 12.0 | 67.0 | 0.0 | 1.0 | 3.0 | 27.0 | 0.0 | ... | 40.0 | 0.0 | 54.0 | 1.0 | 3.0 | 3.0 | Appingedam | Groningen | 12049.0 | 507.0 |
1 | GM0005 | 2016JJ00 | 10.0 | 2.0 | 19.0 | 0.0 | 0.0 | 0.0 | 5.0 | 1.0 | ... | 14.0 | 0.0 | 44.0 | 0.0 | 1.0 | 6.0 | Bedum | Groningen | 10475.0 | 236.0 |
2 | GM0007 | 2016JJ00 | 13.0 | 4.0 | 30.0 | 0.0 | 0.0 | 3.0 | 12.0 | 3.0 | ... | 18.0 | 0.0 | 35.0 | 1.0 | 0.0 | 0.0 | Bellingwedde | Groningen | 8908.0 | 86.0 |
3 | GM0009 | 2016JJ00 | 7.0 | 0.0 | 9.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | ... | 5.0 | 0.0 | 24.0 | 0.0 | 1.0 | 0.0 | Boer, TenTen Boer | Groningen | 7465.0 | 165.0 |
4 | GM0010 | 2016JJ00 | 49.0 | 27.0 | 158.0 | 1.0 | 1.0 | 6.0 | 49.0 | 9.0 | ... | 100.0 | 9.0 | 151.0 | 7.0 | 9.0 | 18.0 | Delfzijl | Groningen | 25686.0 | 198.0 |
5 | GM0014 | 2016JJ00 | 696.0 | 181.0 | 1454.0 | 46.0 | 16.0 | 74.0 | 502.0 | 52.0 | ... | 906.0 | 46.0 | 1506.0 | 50.0 | 574.0 | 171.0 | Groningen | Groningen | 198108.0 | 2474.0 |
6 | GM0015 | 2016JJ00 | 12.0 | 2.0 | 18.0 | 0.0 | 0.0 | 0.0 | 4.0 | 3.0 | ... | 12.0 | 2.0 | 34.0 | 0.0 | 0.0 | 2.0 | Grootegast | Groningen | 12193.0 | 141.0 |
7 | GM0017 | 2016JJ00 | 63.0 | 20.0 | 42.0 | 1.0 | 0.0 | 1.0 | 24.0 | 3.0 | ... | 18.0 | 0.0 | 70.0 | 1.0 | 8.0 | 5.0 | Haren | Groningen | 18790.0 | 405.0 |
8 | GM0018 | 2016JJ00 | 72.0 | 21.0 | 163.0 | 1.0 | 3.0 | 4.0 | 63.0 | 6.0 | ... | 98.0 | 2.0 | 220.0 | 3.0 | 9.0 | 30.0 | Hoogezand-Sappemeer | Groningen | 34360.0 | 521.0 |
9 | GM0022 | 2016JJ00 | 33.0 | 5.0 | 41.0 | 0.0 | 2.0 | 0.0 | 15.0 | 3.0 | ... | 23.0 | 3.0 | 91.0 | 2.0 | 5.0 | 11.0 | Leek | Groningen | 19607.0 | 307.0 |
10 rows × 27 columns
In [7]:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-7-1a695f49106f> in <module>()
----> 1 df_crime_pop.to_csv('', index=True)
~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
1522 doublequote=doublequote,
1523 escapechar=escapechar, decimal=decimal)
-> 1524 formatter.save()
1525
1526 if path_or_buf is None:
~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/io/formats/format.py in save(self)
1635 f, handles = _get_handle(self.path_or_buf, self.mode,
1636 encoding=encoding,
-> 1637 compression=self.compression)
1638 close = True
1639
~/anaconda3/envs/py35r/lib/python3.5/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
388 elif encoding:
389 # Python 3 and encoding
--> 390 f = open(path_or_buf, mode, encoding=encoding)
391 elif is_text:
392 # Python 3 and no explicit encoding
FileNotFoundError: [Errno 2] No such file or directory: ''
In [None]:
Next, let’s create a scatterplot matrix. Scatterplot matrices plot the distribution of each column along the diagonal, and then plot a scatterplot matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.
We can even have the plotting package color each entry by its class to look for trends within the classes.
In [None]:
Correlation Violent crimes vs Population Density
In [None]:
In [None]:
In [None]:
K-means Method
The most basic usage of K-means clustering requires only a choice for the number of clusters, . We rarely know the correct number of clusters a priori, but the following simple heuristic sometimes works well:
where is the number of rows in your dataset. By default, the maximum number of iterations is 10, and all features in the input dataset are used
In [None]:
Write the formula for k
In [None]:
In [None]:
In [None]:
The model summary shows the usual fields about model schema, training time, and
training iterations. It also shows that the K-means results are returned in two
SFrames contained in the model: cluster_id
and cluster_info
. The
cluster_info SFrame indicates the final cluster centers, one per row, in terms
of the same features used to create the model.
The last three columns of the cluster_info SFrame indicate metadata about the corresponding cluster: ID number, number of points in the cluster, and the within-cluster sum of squared distances to the center.
In [None]:
The cluster_id
field of the model shows the cluster assignment for each input
data point, along with the Euclidean distance from the point to its assigned
cluster’s center.
In [None]:
Which are the anomalous points?
What do they have in common within anomalous clusters?