First step towards deep learning
Before reading this article you may want to read another article titled First step towards machine learning that I wrote sometime back, although both the articles can be read independently.
Deep learning has become a synonym of machine learning or even artificial intelligence despite being just one of the machine learning algorithms. Artificial intelligence is much more than deep learning, or even machine learning for that matter and involves areas such as natural language understanding, knowledge representation, robotics etc. One of the reasons for deep learning becoming so popular in the mainstream culture may have something to do with the word ‘deep’ — people associate it with something deeper in terms of the meaning or philosophy. As I will explain here that in deep learning that is not the case and ‘deep’ in deep learning is as informative and profound as ‘deep’ is in the phrase “deep sea”.
In this article I will introduce deep learning from the point of view of transformation functions which take a variable an input and transform that into some other variable as an output. One of the titles for this article could easily be ‘Build your first neural network from scratch using Python’, however, I avoided that since there are already many such articles on Medium.
This is a long article and you must have patience to go to the end. I believe that there are certain ideas about deep learning which everyone must be familiar with. At present it is possible to build very complex networks using high level libraries such as Keras, Tensorflow, Torch, etc., without actually understanding the basic concepts involved, however, it helps in the long run if we understand the fundamental concepts also.
Here the plan is to discuss a set of fundamental ideas which make the backbone of modern deep learning while giving the actual implementation in Python using only mathematical library Numpy and some examples. The codes I give here can be written without Numpy also but that will make the codes much longer.
This article has the following parts :
- Background & Motivation
- Building a neural network from scratch
- Summary & Conclusions
— — — — — — —
- Background & Motivation :
In order to motivate the case for deep learning let us look at the following matrix multiplication.
In the above equation ‘X’ and ‘Y’ vectors of dimensionality ‘N’ and ‘M’ respectively and so ‘A’ of the dimensionality ‘M x N’. Here we can notice the following important properties of the above matrix transformation.
(i). It can transform a vector of dimension N to another vector vector of dimension M.
(ii). The transformation or ‘map’ may or may not be invertible, in the sense, we may or may not get ‘X’ from a given ‘A’ and ‘Y’. In fact ‘A’ is invertible only when it is square. In any case, this matrix operation is useful and is used in linear regression.
Now let us generalise this matrix multiplication in the following way:
Y = A * B * X
Now here we can consider ‘A’ with dimension ‘M x L’ and ‘B’ with dimension ‘L x N’ and the dimensions of ‘X’ and ‘Y’ remain the same. We can make the relation between ‘X’ and ‘Y’ as ‘deep’ as we wish by introducing multiple intermediate matrices such as A, B, C, .. etc. The meaning of ‘deep’ in ‘deep learning’ should be now clear from this example.
Note that in machine learning (at least in supervised machine learning) the aim is to find a ‘map’ or relation between the input vector ‘X’ and output vector ‘Y’, which can be as trivial as a linear transformation. For most practical cases this relation is non-linear. In case you want to know what a non-linear relationship is let me define a linear relationship.
For two variables ‘X_1’ and ‘X_2’ of the same dimensionality if f (X_1+X_2) = f (X_1) + f (X_2) then this relationship, or the function ‘f’, is called a linear relationship. You can check that the above matrix relation is a linear transformation.
The relationship between the input and output in the real world is hardly linear. In another way the amount of change in the output is does not change in the proportion of the change in the input.
If we have a set of uniform grid points and we apply a non-linear transformation on it then in the new space the grid points will no longer remain uniformly distributed and may also get clustered. In fact this is one of the common tasks in machine learning.
When we go from linear transformation, as is represented by the matrix multiplication shown above, to deep learning we need to add the following two elements.
(a) Non-linear activation function : Linear mapping which can be represented by matrix multiplication cannot give us the non-linear mapping between the input and output as we observe in the real world so we need to incorporate a non-linear function, called the activation function. The most common activation used in machine learning is the sigmoid or logistic function which is defined in the following way:
Sigmoid function and its derivate are shown in the following figure.
From the plot of sigma (x) we can see that if two points are separated by a large distance in x-space then they. need not to be separated by a large distance in y-space. In fact this function maps all the points between minus infinity to plus infinity in the range [0, 1]. This is a remarkable property since this function can be used to convert the values of real variables into probability.
Another interesting property of the sigmoid function is that its derivative which is also shown in the figure. The derivative is non-zero only for a small range around x=0. This means that when we are training our network not all the data points contribute equally in the learning process (updating the weights which depend on the gradient of the activation function). The gradient of the sigmoid can be written in terms of sigmoid itself and this is also a very important feature which makes calculation/computation quite easy.
There are many other activation functions apart from sigmoid functions and have their unique features which make them suitable for some other problems.
(b) Loss functions
Machine learning (supervised) is all about mapping/matching the input data to the the. output data. In order to quantify the mismatch between the actual output and the predicted output we use a function called the loss function. There are many choices for the loss functions but generally they fall in one of the following two categories.
(i) Regression losses : When the output, actual as well as predicted, is a continuous variable, like the price of something, we use these losses. which take two real vectors as inputs and give one real number as an output. Two of its examples are as follows:
(*) Mean Squared error — We take the differences of the two real input vectors component wise and square and add take mean of those. This is the most common loss function but there are situations in which it may not be the best choice, for example if errors are not distributed in a Gaussian way. One of the remarkable features of this loss function is that its gradients with respect to the weights can be written a way that just depends on the input vector and the difference between the actual and predicted output. This loss function and its gradient are as follows:
(*) Cosine distance — This is just the cos of the angle between the units vectors representing the two input vectors.
b) Classification losses:
There are many examples in which the output is not a real vector such as when the task is to find the label for a given class. For example, we can decide whether a person is normal or overweight on the basis of his or her body mass index which depend on the height and weight of the person and both are continuous variables. In this case the output — normal or overweight is a discrete or class variable.
We may label classes with numbers like 0, 1, 2 etc., but these numbers do not have much significance apart from representing the class. Such discrete variables are called ‘nominal’ variables, however, there are other discrete variables also called ‘ordinal’, which have some ordering . For example, we can divide the class of students in three categories short, median and tall on the basis of their height. Here these classes are ordinal variables.
In the present article we will restrict ourselves with only binary classes where the output can belong to one of the two classes which we can be represented with ‘0’ and ‘1’.
When the true output is a class then the predicted output is in the form of probabilities for different classes which add up to one. Binary cross entropy as is shown the figure is the most common loss function to quantify the mismatch between the true output and predicted output. This loss function also has the same property which mean squared error has, that the gradients with respect to the weights can be written in a very simple way as shown in the figure below.
(2) Building a neural network from scratch
As I mentioned that the title of this article could be ‘Build your neural network in python from scratch’ but I avoided that since there are already similar articles available on medium. Here, my aim is to motivate the case for neural networks and which will be further expanded in the future articles.
Like in many natural and artificial systems neural networks which make the backbone of deep learning also have two main aspects — the ‘form’ and the ‘function’. Here the form means the architecture and function means the actual mathematical functions involved.
The fundamental building blocks of neural networks are neuron which work like parallel processing units. Every neuron takes a set of inputs, multiplies those with a set of weights, which are associated with the connections a neuron makes, and passes the output to a non-linear activation function as is shown in the figure below. Every neuron also adds a ‘bias’ to the resultant output before applying the activation function. This bias term does not depend on the input and represents the output when there is no input.
Before going to discuss the coding and example sections let me summarise some of the important features of the neural networks.
i) A neural network is formed by a set of neurons stacked together in the form of layers. A layer is just a higher level abstraction and may or may be present in the actual neural networks in the human brain.
ii) In an artificial neural network as is used in machine learning the connections between the neurons are permanent and every connection is represented by a weight. This means that the total number of weights in a neural network depends on the number of layers used and the number of neurons used in each layer.
iii) It is common to consider a fully connected network in which all the neurons of two adjacent layers are connected in all possible ways, however, we may also consider other situation also. In fact once in a while when training a network we can set some of the weights to zero or set to some random values and it has been found useful in addressing some of the issues related to over-fitting.
iv) The type of neurons considered here do not have any memory in a way that the current output of a neuron depends just on its current input. But there are types of neurons used in Recurrent Neural Networks which also have memory and about these I will cover in one of the future articles.
v) There is a mathematical conjecture which says that any neural network with at least one hidden layer can represent the behaviour of any arbitrary nonlinear function.
vi) In many situations it is advised to have a large number of stacked layers with few neurons as compared to a small number of layers with more number of neurons.
Here I will be showing some of the sections of a set of codes I have written from scratch for building neural networks for regression and classification problems. Links for the full codes will be given at the end of the articles.
Activation, loss function & their gradients are shown below.
# The activation/sigmoid function def sigma (x):
return 1/ (1.0 + np.exp(-x))# Gradients of sigmoid def grad_sigma(w, b, x):
s = sigma ( x.dot (w) + b)
grad_w = — s * (1.0 — s) * x
grad_b = — s * (1.0 — s)
return grad_w, grad_b# Mean squared loss functiondef loss (y_true, y_pred):
return np.sum ( (y_true -y_pred) ** 2) / y_true.shape[0]# Gradients of loss function def grad_loss (w, b, x, y, y_pred):
grad_w, grad_b = grad_sigma (w, b, x)
grad_w = — (y-y_pred) * grad_w
grad_b = — (y-y_pred) * grad_b
return grad_w, grad_b
A dense layer of neurons can be written in terms of the following class:
class Dense:
def __init__(self, dim_in, dim_out):
self.W = np.random.random([dim_in, dim_out])
self.b = np.random.random([dim_out]) def out(self, x):
return sigma(x.dot (self.W) + self.b)
Here I am showing a network with just one input & one output layer and this is just for the demonstration purpose. There is another example the link for which is given at the end of the article that has a hidden layer also. The scheme to build the neural network will remain the same. The scheme in which I use gradients explicitly to update the wights cannot be generalised beyond 2–3 hidden layer and in that case it is advisable to use the standard libraries like Keras for that purpose.
Example :
Data : The data for the example is created in the following way:
def input_func (X):
return np.sqrt (np.mean(X))def get_data (n):
# n = number of data points
# dim_input = dimensions of the feature vector. X = np.random.random([n, dim_input])
y = [input_func (X[i,:]) for i in range(0, X.shape[0])]
return X, y
Once the data is created it is split into two parts — training data & testing data. Inside the training module a fraction of training data is used for the validation purpose and the loss is calculated for bot the training and validation data. Validation data is not used for the purpose of updating the parameters. The parameters/weights are updated for every data points and this is not how it is done in a real cases, where, weights are updated in in the batches of the data points. The plot for the loss for the training and validation data is below.
The main module for the Network and training is as follows:
class Network:
def __init__(self, dim_in, dim_out):
self.L = Dense (dim_in,dim_out) def fit(self, X, y, niter, learning_rate, val_split):
x_train, y_train, x_val, y_val =\
train_test_split(val_split, X, y) tr_loss = []
vl_loss = [] for i in range (0, args.niter):
# This loop is for training data
y_train_pred = np.zeros (y_train.shape[0])
y_val_pred = np.zeros (y_val.shape[0])
for j in range(0, x_train.shape[0]):
# predict y
y_train_pred[j] = self.L.out(x_train[j,:])
# get the gradients
grad_w, grad_b = grad_loss (self.L.W, self.L.b,\
x_train[j,:], y[j], y_train_pred[j])
grad_w = grad_w.reshape([grad_w.shape[0],1])
# update the weights
self.L.W = self.L.W + args.learning_rate * grad_w
self.L.b = self.L.b + args.learning_rate * grad_b # get the loss for training data
train_loss = loss (y_train, y_train_pred)
tr_loss.append (train_loss) # This loop is for validation data
for j in range (0, x_val.shape[0]):
y_val_pred[j] = self.L.out(x_val[j,:])
val_loss = loss (y_val, y_val_pred) vl_loss.append (val_loss) plt.plot(tr_loss, label=”Train data”)
plt.plot(vl_loss, label=”Validation data”)
plt.ylabel(“Loss”)
plt.xlabel(“Iterations”)
plt.legend()
plt.show()
The main program is given in the full code which can be found here.
Once the training is done we can use the train model for making prediction for the test data. The following figure shows the actual test data points and predicted test data points.
(3) Summary and Conclusions
This article introduces some of the fundamental ideas of deep learning with giving some concrete examples. This ideas will be further elaborated in future articles about the deep learning. The example, which is shown here has just one input & one output layer. However, an example with a hidden layer will is given here.
This is the first article on deep learning and will be updated regularly. In case you find it useful please like (clap) share & comment. You may also want to read a companion article of it here.