Regression with Python

We will use the library Turicreate and the dataset from Airbnb Belgium open dataset

In [1]:

import turicreate as tc
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

df_rooms = tc.SFrame('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv')
Downloading https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv to /var/tmp/turicreate-pedrohserrano/13091/91587361-c28e-4d95-aae9-d4fd4c4d6e04.csv
Finished parsing file https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv
Parsing completed. Parsed 100 lines in 0.060946 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,int,float,int,float,float,str,float,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv
Parsing completed. Parsed 15711 lines in 0.054176 secs.

In [3]:

df_rooms.head()
room_id host_id room_type borough neighborhood reviews overall_satisfaction accommodates bedrooms
14054734 33267800 Shared room Brussel Brussel 1 0.0 2 1.0
16151530 105088596 Shared room Brussel Brussel 1 0.0 1 1.0
14678546 30043608 Shared room Brussel Brussel 14 4.5 2 1.0
8305401 43788729 Shared room Namur Namur 12 4.5 2 1.0
14904339 15277691 Shared room Namur Gembloux 1 0.0 6 1.0
16228753 61781546 Shared room Antwerpen Antwerpen 3 4.5 2 1.0
643309 3216639 Shared room Roeselare Roeselare 6 4.0 6 1.0
3879691 19998594 Shared room Brugge Knokke-Heist 1 0.0 12 1.0
3710876 18917692 Shared room Antwerpen Antwerpen 11 3.0 3 1.0
5141135 20676997 Shared room Gent Gent 9 4.5 2 1.0
price minstay latitude longitude last_modified
55.0 50.847703 4.379786 2016-12-31
14:49:05.125349 ...
42.0 50.821832 4.366557 2016-12-31
14:49:05.112730 ...
43.0 50.847657 4.348675 2016-12-31
14:49:05.110143 ...
48.0 50.462592 4.818974 2016-12-31
14:49:05.107436 ...
59.0 50.562263 4.693185 2016-12-31
14:49:05.101899 ...
53.0 51.203401 4.392493 2016-12-31
14:49:05.096266 ...
22.0 50.941016 3.123627 2016-12-31
14:49:03.811667 ...
33.0 51.339016 3.273554 2016-12-31
14:49:02.743608 ...
33.0 51.232425 4.424612 2016-12-31
14:49:02.710383 ...
38.0 51.034197 3.714149 2016-12-31
14:49:02.705108 ...
[10 rows x 14 columns]

In [4]:

# Make a train-test split
train_data, test_data = df_rooms.random_split(0.8)

# Automatically picks the right model based on your data.
model = tc.boosted_trees_regression.create(train_data, target='price',
                                    features = ['room_type',
                                                'borough',
                                                'neighborhood',
                                                'reviews',
                                               'overall_satisfaction',
                                               'accommodates',
                                               'bedrooms'], max_iterations=10)

# Save predictions to an SArray
predictions = model.predict(test_data)

# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.
Boosted trees regression:
--------------------------------------------------------
Number of examples          : 11873
Number of features          : 7
Number of unpacked features : 7
+-----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 0.012864     | 1653.675049        | 684.772400           | 99.976921     | 92.150543       |
| 2         | 0.022861     | 1405.623779        | 577.419556           | 83.090813     | 76.248520       |
| 3         | 0.033853     | 1194.780151        | 532.765564           | 72.806427     | 66.368752       |
| 4         | 0.049375     | 1061.806641        | 529.710327           | 66.236198     | 60.575546       |
| 5         | 0.064744     | 944.184692         | 520.996094           | 61.931919     | 58.937630       |
| 6         | 0.078520     | 915.749634         | 518.556152           | 59.524872     | 57.983086       |
+-----------+--------------+--------------------+----------------------+---------------+-----------------+

In [5]:

plt.scatter(test_data['price'], predictions)
<matplotlib.collections.PathCollection at 0x10cd11828>

png

In [6]:

test_data[0]
{'accommodates': 2,
 'bedrooms': 1.0,
 'borough': 'Brussel',
 'host_id': 33267800,
 'last_modified': '2016-12-31 14:49:05.125349',
 'latitude': 50.847703,
 'longitude': 4.379786,
 'minstay': '',
 'neighborhood': 'Brussel',
 'overall_satisfaction': 0.0,
 'price': 55.0,
 'reviews': 1,
 'room_id': 14054734,
 'room_type': 'Shared room'}

In [7]:

predict = model.predict(test_data[0])

In [8]:

print('Prediction of the room: {}'.format(predict[0]))
Prediction of the room: 51.40476608276367
Written on August 16, 2018