### A Guide to quickly immerse yourself in Deep Learning

This post will introduce the reader to the basics of neural networks through a case study that using only 10 lines of Python code creates and trains a neural network that recognises handwritten digits in 3 basic steps:

**1- Load and Preprocess the Data**

**2- Define the Model**

**3- Train the Model**

To do this, we will use the TensorFlow Keras API, the most popular library currently in the Deep Learning community. Let’s go for it!

### Handwritten digits

As a case study, we will create a model that allows us to identify handwritten digits such as te following ones:

The goal is to create a mathematical model that, given an image, the model identify the number it represents. For example, if we feed to the model the first image, we would expect it to answer that it is a 5. The next one a 0, next one a 4, an so on.

#### Classification problem

Actually, we are dealing with a classification problem, which given an image, the model classifies it between 0 and 9. But sometimes, even we can find ourselves with certain doubts, for example, the first image represents a 5 or a 3?

For this purpose, the neural network that we will create returns a vector with 10 positions indicating the likelihood of each of the ten possible digits:

#### Only 10 lines of code

Yes, in just 10 lines of python code you can create and train a neural network model that classify handwritten digits:

1:import tensorflow as tf2:from tensorflow.keras.utils import to_categorical

3:(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()

4:x_train = x_train.reshape(60000, 784).astype('float32')/2555:y_train = to_categorical(y_train, num_classes=10)

6:model = tf.keras.Sequential()7:model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,)))8:model.add(tf.keras.layers.Dense(10, activation='softmax'))

9:model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])

10:model.fit(x_train, y_train, epochs=10, verbose=0)

We used the API Keras of TensorFlow**. **It is the recommended library for beginners, since its learning curve is very smooth compared to others, and at the moment it is one of the popular middleware to implement neural networks. Keras was developed and maintained by François Chollet, an engineer from Google, and it is currently included in Tensorflow library.

#### Environment set up

I suggest to use the *Colaboratory* offered by Google if you want to execute the code described in this post.

It is a Google research project created to help to disseminate Machine Learning education and research. It is a Jupyter notebook environment that requires no configuration and runs completely in the Cloud allowing the use different Deep Learning libraries as TensorFlow and PyTorch. The most important feature that distinguishes Colab from other free cloud services is; Colab provides GPU (or TPU) and is totally free. Detailed information about the service can be found on the faq page.

By default, Colab notebooks run on CPU. You can switch your notebook to run with GPU (or TPU). In order to obtain access to one GPU we need to choose the tab Runtime and then select “Change runtime type” as shown in the following figure:

When a pop-up window appears select GPU. Ensure “Hardware accelerator” is set to GPU (the default is CPU).

Afterwards, ensure that you are connected to the runtime (there is a green check next to “connected” in the menu ribbon):

Now you are able to run the code presented in this post. I suggest to copy & paste the code of this post in a colab notebook in order to see the execution meanwhile you are reading this post.

Ready? Let’s do it!

### 1. Load and Preprocessing Data

First of all we need to import some Python libraries that we need in order to program our neural network in TensorFlow:

import tensorflow as tf from tensorflow.keras.utils import to_categorical

Next step is to loading data that will be used to train our neural network. We will use the MNIST dataset, which can be downloaded from *The** MNIST database *page. This dataset contains 60,000 images of hand-made digits to train the model and it is ideal for entering pattern recognition techniques for the first time without having to spend much time preprocessing and formatting data, both very important and expensive steps in the analysis of data and of special complexity when working with images.

With TensorFlow this can be done using this line of code (line 3):

(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()

**Optional step:** If you want, you can verify the data loaded using this code:

import numpy as np import matplotlib.pyplot as plt

fig = plt.figure(figsize=(25, 4)) for idx in np.arange(20): ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[]) ax.imshow(x_train[idx], cmap=plt.cm.binary) ax.set_title(str(y_train[idx]))

This dataset of black and white images (images contain gray levels) has been normalized to 20×20 pixels while retaining their aspect ratio. Subsequently, the images were centered, calculating the center of mass of these and moving the image in order to position this point in the center of the 28×28 field.

These MNIST images of 28×28 pixels are represented as an array of numbers whose values range from [0, 255] of type `uint8`

. But it is usual to scale the input values of neural networks to certain ranges. In the example of this post the input values should be scaled to values of type `float32`

within the interval [0, 1].

On the other hand, to facilitate the entry of data into our neural network we must make a transformation of the input (image) from 2 dimensions (2D) to a vector of 1 dimension (1D). That is, the matrix of 28×28 numbers can be represented by a vector (array) of 784 numbers (concatenating row by row), which is the format that accepts as input a densely connected neural network like the one we will see in this post.

We can achieve these transformations with the following line of code (line 4):

x_train = x_train.reshape(60000, 784).astype('float32')/255

Furthermore, the dataset has a label for each of the images that indicates what digit it represents (downloaded in `y_train)`

. In our case are numbers between 0 and 9 that indicate which digit the image represents,that is, to which class it is associated.

We need to represent each label with a vector of 10 positions as we presented before, where the position corresponding to the digit that represents the image contains a 1 and the rest contains 0s. This process of transforming the labels into a vector of as many zeros as the number of different labels, and putting a 1 in the index corresponding to the label, is known as *one-hot encoding*. For example, the number 7 will be encoded as:

We can achieve this transformations with the following line of code (line 5):

y_train = to_categorical(y_train, num_classes=10)

### 2. Define the Model

In order to define the model with the Keras’s API we only need these code lines (lines 6–8):

model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(10,activation='sigmoid', input_shape=(784,))) model.add(tf.keras.layers.Dense(10,activation='softmax'))

However, before to explain these code lines, let me introduce some basic neural network concepts.

#### A plain artificial neuron

In order to show how a basic neuronal is, let’s suppose a simple example where we have a set of points in a two-dimensional plane and each point is already labeled “square” or “circle”:

Given a new point “*X*“, we want to know what label corresponds to it:

A common approach is to draw a line that separates the two groups and use this line as a classifier:

In this case, the input data will be represented by vectors of the form (*x1, x2*) that indicate their coordinates in this two-dimensional space, and our function will return ‘0’ or ‘1’ (above or below the line) to know if it should be classified as “square” or “circle”. It can be defined by:

More generally, we can express the line as:

To classify input elements X, which in our case are two-dimensional, we must learn a vector of weight W of the same dimension as the input vectors,that is, the vector (*w1, w2*) and a *b* bias.

With these calculated values, we can now construct an artificial neuron to classify a new element *X*. Basically, the neuron applies this vector *W* of calculated weights on the values in each dimension of the input element *X*, and at the end adds the bias *b.* The result of this will be passed through a non-linear “activation” function to produce a result of ‘0’ or ‘1’. The function of this artificial neuron that we have just defined can be expressed in a more formal way such as:

Now, we will need a function that applies a transformation to variable *z* so that it becomes ‘0’ or ‘1’. Although there are several functions ( “activation functions”), for this example we will use one known as a *sigmoid* function that returns an actual output value between 0 and 1 for any input value:

If we analyze the previous formula, we can see that it always tends to give values close to 0 or 1. If the input z is reasonably large and positive, “e” at minus *z* is zero and, therefore, the *y* takes the value of 1. If *z* has a large and negative value, it turns out that for “e” raised to a large positive number, the denominator of the formula will turn out to be a large number and therefore the value of *y* will be close to 0. Graphically, the sigmoid function presents this form:

So far we have presented how to define an artificial neuron, the simplest architecture that a neural network can have. In particular this architecture is named in the literature of the subject as Perceptron (also called *linear threshold unit* (LTU)), invented in 1957 by Frank Rosenblatt, and visually summarized in a general way with the following scheme:

#### Multi-Layer Perceptron

But before moving forward with the example, we will briefly introduce the form that neural networks usually take when they are constructed from perceptrons like the one we have just presented.

In the literature of the area we refer to a Multi-Layer Perceptron (MLP) when we find neural networks that have an *input layer*, one or more layers composed of perceptrons, called *hidden layers* and a final layer with several perceptrons called the *output layer*. In general we refer to *Deep Learning *when the model based on neural networks is composed of multiple hidden layers. Visually it can be presented with the following scheme:

MLPs are often used for classification, and specifically when classes are exclusive, as in the case of the classification of digit images (in classes from 0 to 9). In this case, the output layer returns the probability of belonging to each one of the classes, thanks to a function called softmax. Visually we could represent it in the following way:

As we mentioned, there are several activation functions in addition to the *sigmoid*, each with different properties. One of them is the one we just mentioned, the *softmax* activation function, which will be useful to present an example of simple neural network to classify in more than two classes. For the moment we can consider the *softmax* function as a generalization of the *sigmoid* function that allows us to classify more than two classes.

#### Softmax activation function

We will solve the problem in a way that, given an input image, we will obtain the probabilities that it is each of the 10 possible digits. In this way, we will have a model that, for example, could predict a five in an image, but only being sure in 70% that it is a five. Due to the stroke of the upper part of the number in this image, it seems that it could become an three in a 20% chance and it could even give a certain probability to any other number. Although in this particular case we will consider that the prediction of our model is a five since it is the one with the highest probability, this approach of using a probability distribution can give us a better idea of how confident we are of our prediction. This is good in this case, where the numbers are made by hand, and surely in many of them, we cannot recognize the digits with 100% certainty.

Therefore, for this example of classification we will obtain, for each input example, an output vector with the probability distribution over a set of mutually exclusive labels. That is, a vector of 10 probabilities each corresponding to a digit and also the sum of all these 10 probabilities results in the value of 1 (the probabilities will be expressed between 0 and 1).

As we have already advanced, this is achieved through the use of an output layer in our neural network with the *softmax*activation function, in which each neuron in this *softmax* layer depends on the outputs of all the other neurons in the layer, since that the sum of the output of all of them must be 1.

But how does the *softmax* activation function work? The *softmax* function is based on calculating “the evidence” that a certain image belongs to a particular class and then these evidences are converted into probabilities that it belongs to each of the possible classes.

An approach to measure the evidence that a certain image belongs to a particular class is to make a weighted sum of the evidence of belonging to each of its pixels to that class. To explain the idea I will use a visual example.

Let’s suppose that we already have the model learned for the number zero. For the moment, we can consider a model as “something” that contains information to know if a number is of a certain class. In this case, for the number zero, suppose we have a model like the one presented below:

In this case, with a matrix of 28×28 pixels, where the pixels in red represent negative weights (i.e., reduce the evidence that it belongs), while that the pixels in blue represent positive weights (the evidence of which is greater increases). The white color represents the neutral value.

Let’s imagine that we trace a zero over it. In general, the trace of our zero would fall on the blue zone (remember that we are talking about images that have been normalized to 20×20 pixels and later centered on a 28×28 image). It is quite evident that if our stroke goes over the red zone, it is most likely that we are not writing a zero; therefore, using a metric based on adding if we pass through the blue zone and subtracting if we pass through the red zone seems reasonable.

To confirm that it is a good metric, let’s imagine now that we draw a three; it is clear that the red zone of the center of the previous model that we used for the zero will penalize the aforementioned metric since, as we can see in the left part of the following figure, when writing a three we pass over:

But on the other hand, if the reference model is the one corresponding to number 3 as shown in the right part of the previous figure, we can see that, in general, the different possible traces that represent a three are mostly maintained in the blue zone.

I hope that the reader, seeing this visual example, already intuits how the approximation of the weights indicated above allows us to estimate what number it is.

Once the evidence of belonging to each of the 10 classes has been calculated, these must be converted into probabilities whose sum of all their components add 1. For this, softmax uses the exponential value of the calculated evidence and then normalizes them so that the sum equates to one, forming a probability distribution. The probability of belonging to class *i*is:

Intuitively, the effect obtained with the use of exponentials is that one more unit of evidence has a multiplier effect and one unit less has the inverse effect. The interesting thing about this function is that a good prediction will have a single entry in the vector with a value close to 1, while the remaining entries will be close to 0. In a weak prediction, there will be several possible labels, which will have more or less the same probability.

#### Sequential class in Keras

The main data structure in Keras is the *Sequential* class, which allows the creation of a basic neural network. Keras also offers an API that allows implementing more complex models in the form of a graph that can have multiple inputs, multiple outputs, with arbitrary connections in between, but it is beyond the scope of this post.

The *Sequential* class of the Keras library is a wrapper for the sequential neural network model that Keras offers and can be created in the following way:

model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(10,activation='sigmoid', input_shape=(784,))) model.add(tf.keras.layers.Dense(10,activation='softmax'))

In this case, the model in Keras is considered as a sequence of layers and each of them gradually “distills” the input data to obtain the desired output.In Keras we can find all the required types of layers that can be easily added to the model through the `add()`

method.

Here, the neural network has been defined as a sequence of two layers that are densely connected (or fully connected), meaning that all the neurons in each layer are connected to all the neurons in the next layer. Visually we could represent it in the following way:

In the previous code we explicitly express in the *input_shape* argument of the first layer what the input data is like: a tensor that indicates that we have 784 features of the model.

A very interesting characteristic of the Keras library is that it will automatically deduce the shape of the tensors between layers after the first one. This means that the programmer only has to establish this information for the first of them. Also, for each layer we indicate the number of nodes that it has and the activation function that we will apply in it (in this example, *sigmoid*).

The second layer in this example is a *softmax* layer of 10 neurons, which means that it will return a matrix of 10 probability values representing the 10 possible digits (in general, the output layer of a classification network will have as many neurons as classes, except in a binary classification, where only one neuron is needed). Each value will be the probability that the image of the current digit belongs to each one of them.

**Optional step: **A very useful method that Keras provides to check the architecture of our model is `summary()`

:

model.summary()

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 10) 7850 _________________________________________________________________ dense_2 (Dense) (None, 10) 110 ================================================================= Total params: 7,960 Trainable params: 7,960 Non-trainable params: 0

For our simple example, we see that it indicates that 7,960 parameters are required (column *Param #*), which correspond to 7,850 parameters to the first layer and 110 to the second.

In the first layer, for each neuron *i* (between 0 and 9) we require 784 parameters for the weights *wij* and therefore 10×784 parameters to store the weights of the 10 neurons. In addition to the 10 additional parameters for the 10 *bj* biases corresponding to each one of them. In the second layer, being a *softmax* function, it is required to connect all 10 neurons with the 10 neurons of the previous layer. Therefore 10×10 *wi* parameters are required and in addition 10 *bj* biases corresponding to each node.

The details of the arguments that we can indicate for the `dense `

layer can be found in the Keras manual. In our example, the most relevant ones appear in the example. The first argument indicates the number of neurons in the layer; the following is the activation function that we will use in it. In this post we discuss in more detail other possible activation functions beyond the two presented here: *sigmoid* and *softmax*.

### 3. Train the Model

We’re almost done, we just have to explain the last two lines of code:

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])

model.fit(x_train, y_train, epochs=10, verbose=0)

#### Learning process

The way how the neural network can learn the weights *W* and the biases *b* of the neurons is an iterative process for all the known labeled input examples, comparing the value of its label estimated through the model, with the expected value of the label of each element. After each iteration, the parameter values are adjusted in such a way that the discordance (error) between the estimated value for the image and the actual value, is becoming smaller. The following scheme wants to visually summarize the learning process of one perceptron in a general way:

#### Configuration of the learning process

We can configure how this learning process will be with the `compile()`

method, with which we can specify some properties through method arguments.

The first of these arguments is the *loss function* that we will use to evaluate the degree of error between calculated outputs and the desired outputs of the training data. On the other hand, we specify an *optimizer* that is the way we have to specify the optimization algorithm that allows the neural network to calculate the weights of the parameters from the input data and the defined loss function.

And finally we must indicate the metric that we will use to monitor the learning process of our neural network. In this first example we will only consider the *accuracy* (fraction of images that are correctly classified). For example, in our case we can specify the following arguments in *compile()* method to test it:

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])

In this example we specify that the loss function is *categorical_crossentropy*, the optimizer used is the *stocastic gradient descent (sgd)* and the metric is *accuracy*, with which we will evaluate the percentage of correct guesses.ç

In a new post the reader could enter in more detail about the learning process.

#### Model training

Once our model has been defined and the learning method configured, it is ready to be trained. For this we can train or “adjust” the model to the training data available by invoking the *fit()* method of the model:

model.fit(x_train, y_train, epochs=10, verbose=0)

In the first two arguments we have indicated the data with which we will train the model in the form of Numpy arrays. The *batch_size* argument indicates the number of data that we will use for each update of the model parameters and with *epochs* we are indicating the number of times we will use all the data in the learning process.

This method finds the value of the parameters of the network through the iterative training algorithm that we mentioned. Roughly, in each iteration of this algorithm, this algorith takes training data from *x_train*, passes them through the neural network (with the values that their parameters have at that moment), compares the obtained result with the expected one(indicated in *y_train*) and calculates the *loss* to guide the adjustment process of the model parameters, which intuitively consists of applying the optimizer specified above in the *compile()* method to calculate a new value of each one of the model parameters (weights and biases)in each iteration in such a way that the loss is reduced.

This is the method that, as we will see, may take longer and Keras allows us to see its progress using the *verbose*argument (by default, equal to 1), in addition to indicating an estimate of how long each *epoch* takes:

Epoch 1/5 60000/60000 [========] — 1s 15us/step — loss: 2.1822 — acc: 0.2916 Epoch 2/5 60000/60000 [========] — 1s 12us/step — loss: 1.9180 — acc: 0.5283 Epoch 3/5 60000/60000 [========] — 1s 13us/step — loss: 1.6978 — acc: 0.5937 Epoch 4/5 60000/60000 [========] — 1s 14us/step — loss: 1.5102 — acc: 0.6537 Epoch 5/5 60000/60000 [========] — 1s 13us/step — loss: 1.3526 — acc: 0.7034 10000/10000 [========] — 0s 22us/step

### Using the model

In order to use the model we can download another set of images (different o the training images) with the following code:

_, (x_test_, y_test_)= tf.keras.datasets.mnist.load_data() x_test = x_test_.reshape(10000, 784).astype('float32')/255 y_test = to_categorical(y_test_, num_classes=10)

#### Optional step: Model evaluation

At this point, the neural network has been trained and its behavior with new test data can now be evaluated using the `evaluation()`

method. This method returns two values:

test_loss, test_acc = model.evaluate(x_test, y_test)

These values indicate how well or badly our model behaves with new data that it has never seen. These data have been stored in *x_test* and *y_test* when we have performed the *mnist.load_data() *and we pass them to the method as arguments. In the scope of this post we will only look at one of them, the accuracy:

print(‘Test accuracy:’, test_acc) Test accuracy: 0.9018

The accuracy is telling us that the model we have created in this post, applied to data that the model has never seen before, classifies 90% of them correctly.

#### Generate predictions

Finally, readers need to know how we can use the model trained in the previous section to make predictions. In our example, it consists in predict which digit represents an image. In order to do this, Keras supply the `predict()`

method.

Let’s choose one image (and plot it) in order to predict the number:

image = 5

_ = plt.imshow(x_test_[image], cmap=plt.cm.binary)

and in order to predict the number we can use the following code:

import numpy as np prediction = model.predict(x_test_) print("Model prediction: ", np.argmax(prediction[image]) )

And that’s all!