Data Science House Price Prediction Project

Screen Shot 2017-10-22 at 12.23.39 AM

This is my first project that I have done in WeCloudData. The purpose of this project is to find out the relationship between the price of the house in Toronto(GTA) and the other features such as location, size of the house, number of bedroom and number of the bathroom.

We start to scrape data from kjiji.com through the URL requests.

Screen Shot 2017-10-22 at 9.15.41 AM.png

Then we parse our data source by using Beautiful Soup.

Screen Shot 2017-10-22 at 9.21.23 AM.png

After we translate the raw data into the cleaned dataset. Our data look like in below, we only show first 5 rows here:

Screen Shot 2017-10-21 at 8.59.37 PM.png

Price represents house of price in Toronto; postcode is the corresponding postcode; FSA is the first three letters of the postcode; bedroom is number of bedroom of that house; bath is representing the number of bathroom in that house; sqrt means squared root of feet of the house; city is the location; the house latitude and the longitude correspondingly.

As you can see, there are many missing data there, next step is to delete and outliers. We first find out the percentage of missing data in each feature. Besides sqrt other features are only have few missing data, however, sqrt have more than 60 percent missing data. In this case, we think this variable(sqrt) is useless. Therefore we forget about this variable and only delete other features missing data and just take other features into account. Then our new dataset is in below, we only show five rows here:

Screen Shot 2017-10-21 at 9.12.40 PM.png

The next step, since we want to find out the relationship between each feature and we can not just put the city into our account. Therefore, we decide to translate city into the mean price of the city which means we translate the categorical variable into quantitative variables. After all, this done, then we now can draw a boxplot and delete the outliers. We first use scipy draw a box plot of the house of price, then you can see there an extremely large outlier there.

Screen Shot 2017-10-21 at 9.38.18 PM.png

Then we delete the outliers and by using plotly to draw a more nice box plot.newplot (1).png

This is the box plot of Price of House. As you can see, the minimum price is around 0, the maximum amount is approximately 1.6M, and the median is approximately 0.7M, and there aren’t outliers since we delete already. After that, we have also drawn the scatter plot for the house of price.

Scatter Plot for House of Price

newplot.png

For the scatter plot in above, the x-axis is the number of the house, the y-axis is the price of the house.

Screen Shot 2017-10-21 at 9.32.29 PM.png

We also drew the histogram of the house of price, as you can see the shape is almost normal. Then we have also done other descriptive statistics, such as pie chart.Screen Shot 2017-10-21 at 9.32.42 PM.png

This is the pie chart of the number of the bedroom in those advertisements. As well as the pie chart of the number of the bathroom as in below:

Screen Shot 2017-10-21 at 9.32.54 PM.png

We have also found out bar chart for the number of bathroom and number of the bedroom. The x-axis is the number and y-axis is the price of the house in Toronto.

Bar Chart number of Bathroom

Screen Shot 2017-10-21 at 9.33.01 PM.png

Bar Chart number of Bedroom

Screen Shot 2017-10-21 at 9.33.10 PM.png

Then we draw the graph of the location of each advertisement. Since our project is mainly focused on Toronto. Then you can see most of the advertisements are located primarily in large Toronto area.

Screen Shot 2017-10-21 at 9.34.08 PM.pngScreen Shot 2017-10-21 at 9.33.54 PM.png

After that, we have also draw QQ plot of the house of price to see whether the house of price is followed the normal distribution. And the answer is yes. Because most of the points are alone with the red line.

Screen Shot 2017-10-21 at 9.34.59 PM.png

Then the last step is to find out the relationship between the house of price and other features except for sqrt. We pick four regression to compare and test: Linear Regression, Lasso Regression, SVM, and Decision Tree Regression. After we use cross-validation method to test those four regression based on our data, we find out the very unfortunately answers. The accuracy of all the models gave me terrible results. Which means all of those models are not good. We have also drawn the Scatter Plot for comparing the test data and predict data for these four methods. The x-axis is the test house of price in Toronto and y-axis is the predict hours of price in Toronto. As you can see, they all perform not very good.

Screen Shot 2017-10-21 at 9.35.14 PM.png

Scatter Plot of Quantitative Comparison

Screen Shot 2017-10-21 at 9.36.08 PM.png        Screen Shot 2017-10-21 at 9.36.19 PM.png

The reason for this result because we may not take sqrt into account. Because sqrt should be one of the leading cause that can affect the price of the house in Toronto. If we want to improve our result and obtain a reasonable conclusion, next time, we need to choose a more valuable web and scrape more features as well as more data.

5 comments

Leave a comment