Anomaly Detection - Outliers
1. Regression with Python
data:image/s3,"s3://crabby-images/6c7cc/6c7ccff2cc480694e01352f4f530c396020b9dc6" alt="An example of analytics with Python An example of a Linear Regression"
2. Simple Linear Regression
data:image/s3,"s3://crabby-images/e2494/e249419769d3899e37c20d446e8a6bee606848ab" alt="Simple Linear Regression Simple Linear Regression"
3. Multiple Regression
data:image/s3,"s3://crabby-images/c4ee3/c4ee387b80ae40d9df9ec1bf79b7a5735635a90d" alt="Multiple Regression Multiple Regression"
4. Local Regression
data:image/s3,"s3://crabby-images/0b67f/0b67f568fd008da46d785c828b7ff7c3f9ab2294" alt="k-nearest Neighbors Regression k-nearest Neighbors Regression"
5. Anomaly Detection - K means
data:image/s3,"s3://crabby-images/a36ea/a36eac017afea735c4b2926f4652b5e341e1b85a" alt="K means Clustering K means Clustering"
6. Anomaly Detection - Outliers
data:image/s3,"s3://crabby-images/dee22/dee22f9e682bd02a75fb5234f2366cbc45dc1c3f" alt="Anomaly Detection - Outliers Anomaly Detection - Outliers"
Point anomalies on test responding
It is mainly important to analyze all the features involved in every test taken by a person, and it is straightforward thinking that not all tasks are with full concentration because the nature of the tool, is an app, and we might expect the people taking the test can get distracted by some random reason
The main objective here is to analyze any pattern related to the time in milliseconds a participant spend on responding every visual task.
In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [None]:
color_ms = '#386cb0' #blue, This is the color chosen for patients with Multiple Sclerosis
color_hc = 'red'#This is the color chosen for health control participants
In [None]:
df_measures_users = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/df_measures_users.csv', encoding="utf-8")
In [None]:
df_symbols = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/df_symbols.csv', encoding="utf-8")
In [None]:
score_variable = 'correct.answers'
Population of Study
In [None]:
print('{} Data points distribuited among {} Participants'.format(len(df_symbols), len(df_measures_users)))
We choose the score variable as the number of correct answers on every test
In [None]:
#Split the patients MS (Multiple Sclerosis) and HC (Health Control)
df_ms = df_measures_users[df_measures_users['ms']==1]
df_hc = df_measures_users[df_measures_users['ms']==0]
In [None]:
print('Patients on MS group: {} ({}%)\nPatients on HC group: {} ({}%) '.format(
len(df_ms), round(len(df_ms)/len(df_measures_users)*100, 0),
len(df_hc), round(len(df_hc)/len(df_measures_users)*100, 0)))
In [None]:
plt.figure(figsize=[16, 8])
sns.kdeplot(df_symbols[df_symbols['ms']==1][score_variable], color='red', label='MS')
sns.kdeplot(df_symbols[df_symbols['ms']==0][score_variable], color='blue', label='HC')
plt.title('Scores Distribution MS and HC Groups')
Defining Boundary
The first thing we have to consider is the time.
So we define $ \Delta_{tr} $ as the time elapsed between 2 trials, where a trial is the event of push the buttom and select a digit on the test.
Thinking on this we say that every person perform $n$ number of tests and every test has $m$ number of trials, then the time elapsed to perform every trial is $ \Delta_{tr} $
We define $T$ as the vector of time responses on an event, following this logic we can calculate the median of those times.
Finally we may calculate the $SD$ of all the people on MS and Health Control grou separatelly
Distraction Points
$T =(\Delta_{tr1}, \Delta_{tr2}, … \Delta_{trm})$ $tr =$ Time of response on a trial
In [None]:
print('Standart Deviation of every delta per group \n MS: {} \n HC: {}'.format(
df_symbols['response_ms_sd'].unique()[0], df_symbols['response_ms_sd'].unique()[1]))
Detecting Outliers
We can just take one single event on a random choise person (eobt3CosDzEtxWW5P) for instance, and plot the time of response in milisecond alongside the 90 seconds showing the symbols choosen on each task
In [None]:
df_user = df_symbols[df_symbols['userId']=='eobt3CosDzEtxWW5P']
stamps = df_user['timestamp'].unique()
plt.figure(figsize=(16, 4))
df_user_ts = df_user[df_user['timestamp']==stamps[0]]
plt.scatter(df_user_ts['trial'], df_user_ts['response_ms'], color=color_ms)
plt.hlines(y=df_user_ts['sup_line'], xmin=0, xmax=max(df_user_ts['trial']), linestyles='--', color=color_ms)
plt.hlines(y=df_user_ts['inf_line'], xmin=0, xmax=max(df_user_ts['trial']), linestyles='--', color=color_ms)
plt.hlines(y=df_user_ts['response_ms_med'], xmin=0, xmax=max(df_user_ts['trial']), linestyles=':', color=color_ms)
plt.fill_between(range(len(df_user_ts['trial'])+1), df_user_ts['sup_line'].mean(), df_user_ts['inf_line'].mean(), alpha=0.2)
plt.xticks(df_user_ts['trial'], df_user_ts['symbol'], rotation=90)
grouped = df_symbols.groupby(['userId','timestamp'])['distract_points'].sum().reset_index()
d_point = grouped[(grouped['userId']=='eobt3CosDzEtxWW5P') & (grouped['timestamp']==stamps[0])]['distract_points'].values
plt.title('Event: 1 Score: '+str(max(df_user_ts['correct.answers']))+' Distraction Points: '+str(d_point[0])+' Median Miliseconds: '+str(df_user_ts['response_ms_med'].mean())+' Timestamp: '+str(stamps[0]))
plt.show()
As we might see the points outside of the boundaries are distraction points because they are 2 SD of the median time of response, where the SD corresponds the variance of the group.
The plot below shows the distribution of the time of response in milliseconds on this same event.
In [None]:
plt.figure(figsize=(16, 4))
plt.subplot(1,2,1)
df_user['response_ms'].hist(bins=16)
plt.subplot(1,2,2)
sns.kdeplot(df_user['response_ms'])
Distraction Points Correlated
The first thing to check is the relation that distracting points have with the score performed on every individual, as we know each test has distraction points and scores, we might aggregate per groups, then we can compare if there is any difference between the number of distractions if a person is MS or HC
In [None]:
df_measures_users.groupby('ms').mean()
The table below shows the distractions and scores on each group
Group | Average Score | Average Distractions |
---|---|---|
MS | 52.91 | 1.81 |
HC | 46.22 | 1.49 |
In [None]:
df_ms = df_measures_users[df_measures_users['ms']==1]
df_hc = df_measures_users[df_measures_users['ms']==0]
In [None]:
sns.jointplot("distract_points", "correct.answers", data=df_ms, kind="reg", color=color_ms) #MS correlation
sns.jointplot("distract_points", "correct.answers", data=df_hc, kind="reg", color=color_hc) #HC correlation
As we may see on the plots above we see a negative and significative correlation, the MS group is more clear. The more distractions the less score
The number of distractions that a single person has is a clear feature that helps to classify the MS people, this feature does not deoend in demographic variables but just with in the test behaviour
Plotting All participants, All events
In [None]:
users = list(zip(df_measures_users['userId'], [color_ms if ms==1 else color_hc for ms in df_measures_users['ms']]))
for user in users:
df_user = df_symbols[df_symbols['userId']==user[0]]
stamps = df_user['timestamp'].unique()
if len(stamps) <= 2**2:
plot_size = (2,2); fig_size = (12*1,8*1)
elif len(stamps) <=3**2:
plot_size = (3,3); fig_size = (12*2,8*2)
elif len(stamps) <=4**2:
plot_size = (4,4); fig_size = (12*3,8*3)
elif len(stamps) <=5**2:
plot_size = (5,5); fig_size = (12*4,8*4)
else:
plot_size = (10,10); fig_size = (3*2**10, 2**10)
plt.figure(figsize=(fig_size[0], fig_size[1]))
for idx, i in enumerate(range(len(stamps))):
plt.subplot(plot_size[0],plot_size[1],idx+1)
df_user_ts = df_user[df_user['timestamp']==stamps[i]]
plt.scatter(df_user_ts['trial'], df_user_ts['response_ms'], color=user[1])
plt.hlines(y=df_user_ts['sup_line'],
xmin=0, xmax=max(df_user_ts['trial']), linestyles='--', color=user[1])
plt.hlines(y=df_user_ts['inf_line'],
xmin=0, xmax=max(df_user_ts['trial']), linestyles='--', color=user[1])
plt.hlines(y=df_user_ts['response_ms_med'],
xmin=0, xmax=max(df_user_ts['trial']), linestyles=':', color=user[1])
plt.fill_between(range(len(df_user_ts['trial'])+1),
df_user_ts['sup_line'].mean(), df_user_ts['inf_line'].mean(), color=user[1], alpha=0.2)
plt.xticks(df_user_ts['trial'], df_user_ts['symbol'], rotation=90)
plt.ylabel('Response in Miliseconds')
grouped = df_symbols.groupby(['userId','timestamp'])['distract_points'].sum().reset_index()
distractions = grouped[(grouped['userId']==user[0]) &
(grouped['timestamp']==stamps[i])]['distract_points'].values
plt.title('# Test: '+str(idx)+' Score: '+str(
max(df_user_ts['correct.answers']))+' Distraction Points: '+str(
distractions[0])+' Median Time: '+str(df_user_ts['response_ms_med'].mean()))
plt.tight_layout()