First step towards machine learning
A. Introduction
This article is for absolute beginners but has material that can help experienced practitioners also to have a new look at some of the familiar concepts.
Any machine learning project will involve many technical terms such as supervised learning/training, inference, training and testing data, optimisation, loss function, hyper parameters etc., which can be explained with a very simple example of linear regression also, as is done in this article. In the present article I take an example of fitting a straight line to a set of noisy data points, following the standard linear regression approach.
I call this problem/algorithm the first step towards machine learning because this is one of the most common and oldest problems/algorithms having its root in the work of Legendre and Gauss in early 1800s. The regression approach was made famous by none other than founder of modern statistical science R.A. Fisher in 1920s.
Before I start explaining the problem let me give some definitions.
1. Learning/training : Finding a model (may be in terms of a mathematical function) which maps a set of input data points to output data points.
2. Supervised learning : Any learning procedure which involves input as well as output data.
3. Model : A map between input and output data points which we get from step (1)
4. Inference : Making the use of a trained model for predicting the output for some unknown inputs.
5. Machine Learning : An algorithmic approach which lets a computer program to find a map between the input and output data. Here we do not use any hard-coded rules to find the map between input and output. In an another way, we can also say that machine learning assumes that all the intelligence is in data itself.
6. Loss function : A mathematical function to quantify the mismatch between the actual output and predicted output by a model.
7. Optimisation : The procedure to find the values of model parameters which minimise the mismatch between the actual output and predicted output (loss function).
8. Accuracy : A measure to quantify how accurate the predictions for a trained model are. Apart from accuracy, there are many other measures and you can check here for detail.
9. Training and testing data : The data which is used to find the map (parameters of model) between the input and output by minimising the loss function is called the training data, and the data which is used to test the trained model is called the test. The training data and testing must have the same statistical properties. It is a common practice to split the we have randomly into the training and testing data. In some cases we use validation data also to tune a set of parameters, called hyper-parameters and are discussed below.
10. Over-fitting : Every data set has noise so when we are matching input to the output during the training procedure we must take care of that by not making the loss too low otherwise we will be over-fitting.
In general, when we decide to use machine learning we must make decision about the choices for the followings.
11. Hyper parameters : It is not true that in machine learning we learning everything from the data. There is a set of parameters, called hyper-parameters, which must tune by carrying out different test. For example, in the examples, discussed below the learning rate (alpha) is a hyper-parameter. In most cases the values of the hyper-parameters are crucial for the performance as well for the accuracy.
12. Regularisation : Learning in machine learning is all about finding the parameters (weights) of the model which can map the input to the outputs more accurately. In many situations, we may not want these parameters to take some values which are unphysical so we may want impose some constraints on those by a scheme called regularisation. There are many regularisation scheme which can elsewhere or read my one of my forthcoming articles.
a) Model/Algorithm
b) Optimisation Algorithm
c) Loss function
d) Performances measures
For rest of the article we use the following choices :
Model — Linear Regression
Optimisation — Gradient descent/ Steepest descent
Loss function — Mean Square Error
Performance measure — R2 or the coefficient of determination
B. Tutorial
In rest of the article we will fit a set of data points (x,y) having Gaussian noise to a line, parameterised by two parameters the slope (w) and intercept (b) which in the language of machine learning are also called “weight” and “bias” respectively.
A). Model : As has been mentioned that we are considering here a linear model for training. Please note that linear model does not necessarily mean that the mapping between the input and output is linear. Linear means the model output is linear with respect to the model parameters ‘w’ and ‘b’ as is explained below.
Defining a linear model : y = f(x) = w * x + bIn case we consider ‘y’ as a polynomial still then it will be a linear fit.f(x) = u * x³ + v * x² + w *x + bIf you are not sure about what a linear function is then it can be defined in the following way:f(x_1+x_2) = f(x_1) + f(x_2)You can verify that both the models we have given up satisfy this relation for x1, x2=w1, w2 or b1, b2.
B) Optimisation :
As has been mentioned that optimisation is a process by which we find a point, called global minimum, at which a (loss) function has the lowest value. For example, x=2 is the global minimum of y = (x-2)**2.
Here we will use an algorithm called steepest descent or gradient descent which was given by famous mathematician Cauchy in 1847. This algorithm starts with a guess value for the global minimum and improve that iteratively.
Note that here we are trying to find the global minimum of the loss function which will be defined below.
The loss function L(X), in a N-dimensional space (X has N components), is a scalar quantity and its gradient is a N-dimensional vector. In the gradient descent method we use the directions provided by the gradient of the loss function to find the next best point when searching for the global minimum.
The key idea of gradient descent is to move along the direction of the gradient in the opposite direction of that as shown in the Figure (1). We use gradient not only to find the direction to move, we also set the step size that is proportional to the gradient. The value of the parameters are updated according to the equation (4) and (5) as shown in the Figure (2).
In the figure we have shown two points P and Q with their tangents T1, T2 and projections on the x and y-axis. From the figure it is clear that gradient is large at P and small at Q. Since gradient at both the points are positive so we move in the direction of negative x-axis. Imagine, if we were on the other side of the minimum then the gradients would have been negative and we would have been moving in the direction of +ve axis.
Gradient descent algorithm clearly says that we should take small steps when gradient is small and that mean we should move slowly when are close to global minimum and have fast when far way and that makes sense.
C) Loss function : In the preset case we consider mean square error as shown in Figure (2) as our loss function. The figure also show the gradients as well as update equations.
Note that in the present case the dimensionality of the input vector (x) is one but our parameter space is two-dimensional — w and b. If the input vector x would had been n-dimensional in that case we will require a (n+1) dimensional parameter space — n slope parameters (weights) ‘w’ and one intercept (bias) ‘b’.
The mismatch between the actual output y_i and predicted output y(x_i) = w x_i + b, is called the error term e_i as is shown by equation in figure.
Note that these errors terms e_i are in general random (Gaussian) variables with mean 0 so we square and sum them and use the mean of that as our loss function, called Mean Square Error or MSE as is shown by equation (1).
There are many options for the loss function but MSE is the most common when output variable ‘y’ is a continuous variable. One of the best features of MSE is that it is quadratic and differentiable.
D) Performance measures : Coefficient of determination or R2
There are many performance measures and here we will use something called R2 or coefficients of determinant. Basically, it quantifies the fraction of the variance of the output variable ‘y’ which can be attributed to the independent variable ‘x’. The expression for R2 is shown in the following Figure (3).
It is also common to define R2 without subtracting the fraction from 1 as we have done above. For our case, R2 close to 0 will be considered a good food and in the other case that would be close to 1. In case confusion please refer any other reference.
C. Results
In this section we will show some results of an exercise which we have carried out to fit a line to a noisy data set.
We create the data with the following module:
def line (w, b, x):
return w * x + bdef get_data (): n = 100 # number of data points
weight = 1.81
bias = 34.0 # for noise
mu = 0.0
sigma = 2.0
# we add Gaussian noise to data
x = np.linspace (0, 10, n)
y = line (weight, bias, x) + np.random.normal(mu,sigma,n) return weight, bias, mu, sigma, x, y
We also add Gaussian noise to the data as the above program shows. You can see the full program here.
In the Figure (4) we show the fitting of the model with the data. Parameters which are used to create the data as well as fitting parameters are also shown in the labels.
The Figure (5) shows how the values of the fitting parameters change with the iterations. Note that since the problem here we consider is really trivial so the values of the parameters change smoothly, however, that may not be the case in general.
Now we show the performance matrices, the loss function and the R2 values in the Figure (6) below.
From the above figure we can see that the loss function drops smoothly as we iterate and finally, it reaches to a minimum value and after that drops very slow so we can stop there.
Summary —
In this article I have explained linear regression which is one of the most basis machine learning algorithms but can be used to understand many important concepts of machine learning. I have tried to explain many ideas in detail but there is a still a lot of room to expand it further. In particular I would like to explain inference, gradient descent in detail and show examples for the cases when input vector ‘x’ is multi-dimensional. You can find the codes for this article here .
If you find this article please like (clap), share and post your comments in case you have any. Keep checking my page for new articles.