Machine learning is one of the hot topics right now in 2020 and everyone wants a slice of the pie. The biggest problem is that people who are enthusiastic about machine learning always want to jump the gun in this field and start this journey by doing the advanced stuff such as convolutional neural nets, deep learning e.t.c without any regards to the basics. For instance, just like climbing a ladder, you have to start from the bottom while studying machine learning.
In this machine learning beginner to advanced series, we will be looking at the most basic concepts of machine learning and gradually proceed to the advanced things.
It doesn’t matter if you are a beginner or an advanced data scientist, this tutorial series is well curated for you and I assure you that by the end of it you will be an expert in this field, ready to start your own startup or ace that interview.
However, there are some basic requirements if you want to follow along smoothly. These are:
- Be familiar with the basics of python object oriented programming.
- Be familiar with the basic functionalities of jupyter notebook.
Basically, the above listed things are my assumptions that you are familiar with them because I will not be going through them in this series. A point to note is that this series is a summary and simplification of Aurelien Geron’s book ( Hands on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.) So most of the examples and explanations will be derived from that book. I hope you are excited to learn new things, so buckle up and let’s get started.
What is Machine Learning?
In short, machine learning is the art and science of programming computers to learn from data. From an engineering perspective, a machine is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with experience. One of the earliest machine learning programs is the spam filter which was built in the 1990’s. The programs learns from a training set. In this case, the task, T is to flag spam for new emails, the experience E is the training data and the performance measure P is to be determined. This performance measure is called accuracy and is often used in classification tasks.
Machine learning can also be used in problems that are either too complex for traditional approached or have no known algorithms. An example of this is speech recognition. The best solution nowadays of speech recognition problems is to write an algorithm that learn son itself, given many examples of recordings.
Finally, Machine Learning can help humans learn by inspecting the patterns learnt by the algorithms. For instance, once the spam filter has been trained on enough spam emails, it can be inspected to reveal the combination of words that it believes are the best predictors of spam. Sometimes this will reveal new trends, and thereby lead to a better understanding of the problem. This process of applying ML techniques to discover patterns from large amounts of data is called data mining.
In summary, machine learning is great for:
- Problems for which existing solutions require a lot of hand-tuning or long lists of rules
- Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
- Fluctuating environments: a Machine Learning system can adapt to new data.
- Getting insights about complex problems and large amounts of data.
Types of Machine Learning
1. Supervised vs Unsupervised Learning
This type of classification is based on the type of supervision machines get during training. There are 4 major categories: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.
In supervised learning, the training data that is fed to the algorithm includes the expected solutions of the model, otherwise known as labels. An example is the spam filtering problem explained above, which is a classification task.
Another example is a prediction problem for instance, predicting the price of a car, given some features e.g. age, mileage etc. called predictors. This type of task is called regression.
To train the system, you need to give it many examples of cars, including both their predictors and their labels (i.e., their prices).
A point to note is that some regression algorithms are also used for classification, e.g the Logistic Regression algorithm, which we will cover later. Some of the supervised learning algorithms that we shall look at are:
- Logistic regression
- Linear Regression
- k-Nearest Neighbors
- Support Vector Machines (SVMs)
- Neural networks
- Decision Trees and Random Forests
The training data in unsupervised learning is not labeled and the system learns on its own. Some of the supervised learning algorithms that we shall look at are:
- Clustering (K-means)
- Visualization and dimensionality reduction (Principal Component Analysis: PCA)
- Association rule learning (Apriori)
Some of the practical tasks which we shall look at practically are dimensionality reduction in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.
Other typical usages of unsupervised learning are in anomaly detection such as detecting credit card fraud and manufacturing flaws. It can also be used in the automatic removal of outliers, which are points that differ significantly from other points.
Another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes. For example, suppose you own a supermarket.
Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other.
2. Semi-supervised learning
In this type of learning, there is a lot of unlabeled data and a few labeled data. A photo-hosting service, such as Google Photos, is a good example of this.
Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all the system needs is for you to tell it who these people are. Just one label per person, 4 and it is able to name everyone in every photo, which is useful for searching photos.
3. Reinforcement Learning
In reinforcement learning, the learning system, called agent, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy,to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
Robots mainly use this learning approach to learn how to walk.
Main Challenges of Machine Learning
There are two major challenges in ML which are “bad data” and “bad algorithms”.
Bad data can comprise of insufficient quantity of training data, poor quality data (contains noise, outliers & errors) and irrelevant features. Irrelevant features are removed through:
- Feature selection which is selecting the most useful features to train on among existing features.
- Feature extraction which is combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help).
- Creating new features by gathering new data.
Bad algorithms can comprise of: data over-fitting (The ML algorithm performs well on the training data set and poorly in the testing data set.), data under-fitting (Which is the opposite of over-fitting where the model is too simple to learn the underlying structure of the data.)
We have just concluded the most basic fundamentals of Machine Learning. The next part continues in the next page.
To prove your prowess on what you have learnt, make sure you can comfortably answer the following questions:
1. How would you define Machine Learning?
2. Can you name four types of problems where it shines?
3. What is a labeled training set?
4. What are the two most common supervised tasks?
5. Can you name four common unsupervised tasks?
6. What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?
7. What type of algorithm would you use to segment your customers into multiple groups?
8. Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?