Part 2: Sample Machine Learning Project
In this part, you will work under the assumption that you have just been recently hired as a lead data scientist in a real estate company. You will go through the whole data science process from data gathering to the launch. P.S, if you are scared of code, you are just about to have the most amazing experience of your life 😉
To work with data, you need some good data sources. Some examples are:
- Popular open data repositories:
UC Irvine Machine Learning Repository
Amazon’s AWS datasets
- Meta portals (they list open data repositories):
- Other pages listing many popular open data repositories:
Wikipedia’s list of Machine Learning datasets
We will be focusing on the California Housing Prices dataset from the StatLib repository and the data can also be found by doing a quick search on Kaggle. This dataset was based on data from the 1990 California census.
The first task you are asked to perform is to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Your model should learn from this data and be able to predict the median
housing price in any district, given all the other metrics.
The following checklist can guide you through your Machine Learning projects. There are eight main steps:
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.
First, you need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task or a regression task. Should you use batch learning or online learning techniques?
This type of problem is obviously a supervised learning problem since your training data is labeled. Moreover, it is also a regression task, since you are asked to predict a value, more specifically, a multivariate regression problem since the system will use multiple features to make a prediction.
Next you need to select a performance measure for your model. A typical performance measure for regression problems is the Root Mean Squared Error (RMSE) which measures the standard deviation of the errors the system makes in its predictions. To compute RMSE, the following equation is used:
RMSE is a good measure to use if we want to estimate the standard deviation σ of a typical observed value from our model’s prediction. However, in some contexts you may prefer to use another function. For instance, if there are many outliers, you may consider using the Mean Absolute Error (also called the Average Absolute Deviation)
Finally, you need to understand that the output of your model needs to be fed into another downstream
Machine Learning system, a process called pipelining. Now that you have understood the problem, now let’s get our hands dirty with code.
Get the Data
As previously said in part one, we will not go through the basics of stetting up your environment. There are tones of resources online to guide you depending on your system. Things to note however are that we shall be using python and jupyter notebook quite often to code.
You can get python from https://www.python.org/. You will need a number of Python modules: NumPy, Pandas, Matplotlib, and Scikit-Learn which can be easily installed using pip or conda commands.
You can find the California housing data from Kaggle. Now let’s load the data using Pandas. We will be using a small function to load the data as illustrated below:
The output of the first 5 rows output by the head() function is:
There are 10 attributes and you can get more description using the info() method. i.e (housing.info())
There are 20,640 instances in the dataset, which means that it is fairly small by Machine Learning standards, but it’s perfect to get started. Notice that the total_bedrooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. We will need to take care of this later.
All attributes are numerical, except the ocean_proximity field. Its data type is object.
Let’s look at the other fields. The describe() method shows a summary of the numerical attributes.
The count, mean, min, and max rows are self-explanatory. Note that the null values are ignored (so, for example, count of total_bedrooms is 20,433, not 20,640). The std row shows the standard deviation (which measures how dispersed the values are). The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations falls. For example, 25% of the districts have a housing_median_age lower than 18, while 50% are lower than 29 and 75% are lower than 37. These are often called the 25th percentile (or 1st quartile), the median, and the 75th percentile (or 3rd quartile).
Hopefully you now have a better understanding of the kind of data you are dealing with. We now need to create a test dataset and set it aside. This will be used to evaluate our model when its given data it has never seen before.
Creating Test Set
Typically, only 20% of the data is needed for testing and the other 80% will be used for training purposes.