1. Regression with Python
2. Simple Linear Regression
3. Multiple Regression
4. Local Regression
5. Anomaly Detection - K means
6. Anomaly Detection - Outliers
k-nearest Neighbors Regression
In this notebook, you will implement k-nearest neighbors regression. You will:
- Find the k-nearest neighbors of a given query input
- Predict the output for the query input using the k-nearest neighbors
- Choose the best value of k using a validation set
Because the features in this dataset have very different scales (e.g. price is in the hundreds of thousands while the number of bedrooms is in the single digits), it is important to normalize the features
To efficiently compute pairwise distances among data points, we will convert the
SFrame into a 2D Numpy array. First import the numpy library and then copy and
get_numpy_data() from the second notebook of Week 2.
Using all of the numerical inputs listed in
feature_list, transform the
training, test, and validation SFrames into Numpy arrays:
In computing distances, it is crucial to normalize features. Otherwise, for
sqft_living feature (typically on the order of thousands) would
exert a much larger influence on distance than the
bedrooms feature (typically
on the order of ones). We divide each column of the training feature matrix by
its 2-norm, so that the transformed column has unit norm.
Find the (error) way to call the features
The most critical choice in computing nearest neighbors is the distance function that measures the dissimilarity between any pair of observations.
Distance functions are also exposed in the turicreate.distances module. This allows us not only to specify the distance argument for a nearest neighbors model as a distance function (rather than a string), but also to use that function for any other purpose.
In the following snippet we use a nearest neighbors model to find the closest reference points to the first three rows of our dataset, then confirm the results by computing a couple of the distances manually with the Manhattan distance function.
Search methods Another important choice in model creation is the method. The brute_force method computes the distance between a query point and each of the reference points, with a run time linear in the number of reference points. Creating a model with the ball_tree method takes more time, but leads to much faster queries by partitioning the reference data into successively smaller balls and searching only those that are relatively close to the query. The default method is auto which chooses a reasonable method based on both the feature types and the selected distance function. The method parameter is also specified when the model is created. The third row of the model summary confirms our choice to use the ball tree in the next example.