# Anomaly Detection - Outliers

1. Regression with Python

2. Simple Linear Regression

3. Multiple Regression

4. Local Regression

5. Anomaly Detection - K means

6. Anomaly Detection - Outliers

## Point anomalies on test responding

It is mainly important to analyze all the features involved in every test taken by a person, and it is straightforward thinking that not all tasks are with full concentration because the nature of the tool, is an app, and we might expect the people taking the test can get distracted by some random reason

The main objective here is to analyze any pattern related to the time in milliseconds a participant spend on responding every visual task.

**In [None]:**

**In [None]:**

**In [None]:**

**In [None]:**

**In [None]:**

### Population of Study

**In [None]:**

We choose the score variable as the number of correct answers on every test

**In [None]:**

**In [None]:**

**In [None]:**

### Defining Boundary

The first thing we have to consider is the time.

So we define $ \Delta_{tr} $ as the time elapsed between 2 trials, where a trial is the event of push the buttom and select a digit on the test.

Thinking on this we say that every person perform $n$ number of tests and every test has $m$ number of trials, then the time elapsed to perform every trial is $ \Delta_{tr} $

We define $T$ as the vector of time responses on an event, following this logic we can calculate the median of those times.

Finally we may calculate the $SD$ of all the people on MS and Health Control grou separatelly

##### Distraction Points

$T =(\Delta_{tr1}, \Delta_{tr2}, … \Delta_{trm})$ $tr =$ Time of response on a trial

**In [None]:**

### Detecting Outliers

We can just take one single event on a random choise person (eobt3CosDzEtxWW5P) for instance, and plot the time of response in milisecond alongside the 90 seconds showing the symbols choosen on each task

**In [None]:**

As we might see the points outside of the boundaries are distraction points because they are 2 SD of the median time of response, where the SD corresponds the variance of the group.

The plot below shows the distribution of the time of response in milliseconds on this same event.

**In [None]:**

### Distraction Points Correlated

The first thing to check is the relation that distracting points have with the score performed on every individual, as we know each test has distraction points and scores, we might aggregate per groups, then we can compare if there is any difference between the number of distractions if a person is MS or HC

**In [None]:**

The table below shows the distractions and scores on each group

Group | Average Score | Average Distractions |
---|---|---|

MS | 52.91 | 1.81 |

HC | 46.22 | 1.49 |

**In [None]:**

**In [None]:**

As we may see on the plots above we see a negative and significative
correlation, the MS group is more clear.
**The more distractions the less score**

The number of distractions that a single person has is a clear feature that helps to classify the MS people, this feature does not deoend in demographic variables but just with in the test behaviour

### Plotting All participants, All events

**In [None]:**