Local Regression

k-nearest Neighbors Regression

In this notebook, you will implement k-nearest neighbors regression. You will:

  • Find the k-nearest neighbors of a given query input
  • Predict the output for the query input using the k-nearest neighbors
  • Choose the best value of k using a validation set

png

In [None]:

import turicreate as tc
import pandas as pd
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:

df_sales = tc.SFrame('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/houses-data.csv')

In [None]:

df_sales.head()

Because the features in this dataset have very different scales (e.g. price is in the hundreds of thousands while the number of bedrooms is in the single digits), it is important to normalize the features

To efficiently compute pairwise distances among data points, we will convert the SFrame into a 2D Numpy array. First import the numpy library and then copy and paste get_numpy_data() from the second notebook of Week 2.

In [None]:

def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 
    features = ['constant'] + features 
    features_sframe = data_sframe[features] 
    feature_matrix = features_sframe.to_numpy()
    output_sarray = data_sframe[output]
    output_array = output_sarray.to_numpy()
    return feature_matrix, output_array

In [None]:

(example_features, example_output) = get_numpy_data(df_sales, ['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print (example_features[0,:]) # this accesses the first row of the data the ':' indicates 'all columns'
print (example_output[0]) # and the corresponding output

In [None]:

train_and_validation, test = df_sales.random_split(.8, seed=1) # initial train/test split
train, validation = train_and_validation.random_split(.8, seed=1) # split training set into training and validation sets

In [None]:

feature_list = df_sales.column_names()[3:-1]

Using all of the numerical inputs listed in feature_list, transform the training, test, and validation SFrames into Numpy arrays:

In [None]:

features_train, output_train = get_numpy_data(train, feature_list, 'price')
features_test, output_test = get_numpy_data(test, feature_list, 'price')
features_valid, output_valid = get_numpy_data(validation, feature_list, 'price')

In computing distances, it is crucial to normalize features. Otherwise, for example, the sqft_living feature (typically on the order of thousands) would exert a much larger influence on distance than the bedrooms feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.

In [None]:

def normalize_features(feature_matrix):
    norms = np.linalg.norm(feature_matrix, axis=0)
    normalized_features = feature_matrix / norms
    return normalized_features, norms

In [None]:

features_train, norms = normalize_features(features_train) # normalize training set features (columns)
features_train= tc.SFrame(data=pd.DataFrame(features_train))
features_test = features_test / norms # normalize test set by training set norms
features_test = tc.SFrame(data=pd.DataFrame(features_test))
features_valid = features_valid / norms # normalize validation set by training set norms
features_valid = tc.SFrame(data=pd.DataFrame(features_valid))

Fitting KNN

Find the (error) way to call the features

In [None]:

knn_model = tc.nearest_neighbors.create(features_train, features = )

In [None]:

knn_model.summary()

Distance functions

The most critical choice in computing nearest neighbors is the distance function that measures the dissimilarity between any pair of observations.

In [None]:

model = tc.nearest_neighbors.create(features_train, features = ,
                                    distance='manhattan')
model.summary()

Distance functions are also exposed in the turicreate.distances module. This allows us not only to specify the distance argument for a nearest neighbors model as a distance function (rather than a string), but also to use that function for any other purpose.

In the following snippet we use a nearest neighbors model to find the closest reference points to the first three rows of our dataset, then confirm the results by computing a couple of the distances manually with the Manhattan distance function.

In [None]:

model = tc.nearest_neighbors.create(features_train, features = ,
                                    distance=tc.distances.manhattan)

In [None]:

sf_check = features_train[['0', '1', '2']]
print ("distance check 1:", tc.distances.manhattan(sf_check[2], sf_check[10]))
print ("distance check 2:", tc.distances.manhattan(sf_check[2], sf_check[14]))

Search methods Another important choice in model creation is the method. The brute_force method computes the distance between a query point and each of the reference points, with a run time linear in the number of reference points. Creating a model with the ball_tree method takes more time, but leads to much faster queries by partitioning the reference data into successively smaller balls and searching only those that are relatively close to the query. The default method is auto which chooses a reasonable method based on both the feature types and the selected distance function. The method parameter is also specified when the model is created. The third row of the model summary confirms our choice to use the ball tree in the next example.

In [None]:

model = tc.nearest_neighbors.create(features_train, features = ,
                                    method='ball_tree', leaf_size=5)
model.summary()
Written on August 16, 2018