Deep Learning Essentials : Key Concepts Before Diving Deep

Iftekhar AhmedFeb 17, 2024

Deep Learning is a vast field, and covering each topic in a single blog is not possible. The main premise of this blog is to introduce you to the key concepts of Deep Learning. This blog will cover (most of the) key concepts that you need to be familiar with while starting with Deep Learning. One thing to keep in mind, this blog will not go in-depth with the topics, consider this blog as a checklist to familiarize yourself with key Deep Learning concepts as you begin your journey. I also encourage you to delve deeper into the bold keywords throughout the blog.

This blog assumes you are familiar with general Machine Learning so, topics such as different types of learning, supervised, unsupervised, ground truth, feature, target, classification, and regression will not be discussed. This blog will only focus on deep learning concepts.

Mathematics

Linear Algebra

Vectors and matrices: Understanding operations such as addition, multiplication, and transpose.
Matrix multiplication: Including dot product, Hadamard product, and matrix-vector multiplication.
Determinants and inverses.
Eigenvalues and eigenvectors.
Singular Value Decomposition (SVD) and its applications.

Calculus

Differentiation and integration: Knowing how to compute derivatives and integrals.
Chain rule and partial derivatives: Crucial for understanding backpropagation, a fundamental algorithm in deep learning.
Statistical inference: Hypothesis testing, confidence intervals, and p-values.

Probability and Statistics

Probability distributions: Understanding concepts like Gaussian (normal), Bernoulli, binomial, and multinomial distributions.
Expected value, variance, and standard deviation.
Bayes' theorem and conditional probability.
Maximum Likelihood Estimation (MLE).
Statistical inference: Hypothesis testing, confidence intervals, and p-values.

Neural Network Structure

Neuron

Neuron in deep learning is the building block of neural networks. you can consider it as a function that takes input and after performing some operations outputs the result. Combining multiple neurons we build neural network.

Perceptron

Perceptron is a single neuron that takes one or more inputs and after doing some operation it outputs the results.

Input

The data you provide to your neuron/perceptron / neural network. Usually denoted by x1, x2, … x(n). The first layer of a neural network is called an input layer, which contains only the inputs of the network.

Weights

Weights are connected with the input of the neuron which determines how impactful the input is for example,
if the first input x1 is 2 and x2 is 4 and their weights are w1 = 3 and w2 = 1. then,
x1 _ w1 = 2 _ 3 = 6
x2 _ w2 = 4 _ 1 = 4
that means the first input x1 is more impactful than x2.

Layers

A layer consists of multiple neurons,

Input Layer: The input layer is the front layer of a neural network which contains the input of the neural network.
Hidden Layer: The hidden layer is the middle layer between the input and output layer.
Output Layer: The output layer is the output of the neural network.

Bias

You can think biases are some adjustable knobs that you can tweak according to your need.
Biases are added with the input and weight production. (x1 * w1 + b1), so it can trigger a neuron even if the inputs of that neuron is zero. This helps the network learn better and make accurate.

Activation Function

Activation function is a mathematical function that determines whether the neuron should be activated or not. It takes the information from the previous neuron and applies some mathematical functions to decide whether to send its own signal further or not.
Some of the most used activation functions are,

Sigmoid: It allows some information through (between 0 and 1) but not everything.
Tanh: Similar to sigmoid, but the filter goes from -1 to 1.
ReLU: It lets all positive information through and blocks negative information.
Leaky ReLU: Similar to ReLU, but allows a tiny bit of negative information to leak through.
Softmax: Turns any set of numbers into probabilities that always sum to 1.

Inner Workings of Neural Network

Forward Pass

Forward pass is the process of sending data from the input layer to the output layer, here are the steps happen in a forward pass

Start with the input layer: This layer receives the initial data you want the network to process.
Move to the next layer: Each neuron in this layer takes the values from the previous layer, multiplies them by weights, and sums them up.
Apply an activation function: This function adds non-linearity to the network, allowing it to learn complex patterns. The output of this function becomes the input for the next layer.
Repeat steps 2 and 3: This process continues through all the hidden layers of the network.
Reach the output layer: The final layer uses the information from the last hidden layer to produce the final prediction or output.

Backpropagation

Backpropagation is the most important algorithm you should learn to understand how neural networks learn. It calculates the gradient of the loss function with respect to the weights of the network, allowing for the adjustment of weights to minimize the error between the predicted output and the actual target output.
Here’s the inner workings of backpropagation:

1. Loss calculation

After the forward pass, the output of the neural network is compared to the actual target values (ground truth) from the training data. This is done using a loss function, which quantifies the difference between the predicted output and the actual target values. Here are some commonly used loss function,
For Regression Problem:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)
For Classification Problem
Binary Cross-Entropy (BCE)
Categorical Cross-Entropy (CCE)

2. Backpropagation of Errors

Once the loss has been computed, the next step is to propagate this error backward through the network using the backpropagation algorithm. The goal of backpropagation is to calculate the gradient of the loss function with respect to each weight and bias in the network. This is done by applying the chain rule of calculus recursively, starting from the output layer and moving backward through the network.

3. Gradient Calculation

As backpropagation progresses backward through the network, the gradients of the loss function with respect to the weights and biases of each layer are computed. These gradients represent the direction and magnitude of the change needed to minimize the error.

4. Weight Update

Once the gradients have been calculated, optimization algorithm is applied to update the weights of the network. Common optimization algorithms include stochastic gradient descent (SGD) and its variants like mini-batch gradient descent, Adam, RMSProp, Adagrad, etc. Each optimizer has its own update rule and hyperparameters that influence the training process. The optimizer uses the computed gradients to adjust the weights in a way that reduces the loss function. The specific update rule used by the optimizer depends on the optimization algorithm chosen. The optimizer updates the weights of the network according to the chosen optimization algorithm. The learning rate, which determines the size of the steps taken during optimization, is an important hyperparameter that affects the convergence and performance of the training process.

5. Iteration

After the weights have been updated by the optimizer, steps 1-4 are repeated for multiple iterations or epochs, with each iteration consisting of a forward pass, loss calculation, backpropagation of errors, and weight update. The number of iterations depends on factors such as the complexity of the problem, the size of the dataset, and the convergence criteria defined by the user.

6. Evaluation

Periodically during training, the performance of the model is evaluated on a separate validation dataset to monitor its progress and prevent overfitting. This evaluation helps determine whether the model is generalizing well to unseen data or if adjustments to the training process are necessary.

Tweaking Neural Network

Regularization Techniques

Regularization techniques in deep learning are methods used to prevent overfitting. Here are some common regularization techniques used in deep learning:

L1 and L2 Regularization:

L1 regularization (also known as Lasso regularization) shrinks coefficients towards zero, potentially setting some to zero entirely, leading to sparse models and feature selection.
L2 regularization (also known as Ridge regularization) extends L1 by promoting group sparsity, encouraging entire groups of features to be zero simultaneously, leading to potentially more structured models compared to L1.

Dropout:

Dropout is a technique where randomly selected neurons are ignored during training with a certain probability (typically 0.5).

Batch Normalization:

Batch normalization normalizes the activations of each layer to have zero mean and unit variance. It helps stabilize the training process, speeds up convergence, and reduces the sensitivity to the choice of initialization parameters.

Early Stopping:

Early stopping involves monitoring the performance of the model on a validation set during training. Training is terminated when the performance stops improving or starts to deteriorate, preventing the model from overfitting the training data.

Data Augmentation:

Data augmentation involves generating new training samples by applying transformations such as rotation, translation, scaling, and flipping to the existing data. This increases the diversity of the training set and helps the model generalize better to unseen data.

Hyperparameter Tuning

hyperparameter tuning is the process of finding the optimal combination of settings that control the learning process of your model, ultimately leading to better performance.
Examples of hyperparameters in deep learning include:

Learning Rate: Controls step size in optimization, impacting convergence speed and overshooting risk.
Batch Size: Specifies training examples processed before updating, affecting convergence and memory usage.
Number of Layers and Neurons: Defines complexity, influencing pattern capture and over/underfitting risk.
Activation Functions: Transform neuron outputs, crucial for capturing data complexity.
Regularization Strength: Controls overfitting prevention, enhancing generalization.
Optimizer and Parameters: Determines parameter update method, impacting convergence and performance.

Coding Optimization Technique

Vectorization

Vectorization in deep learning refers to the technique of performing operations on entire arrays of data simultaneously instead of iterating through individual elements. This significantly improves computational efficiency and simplifies code compared to traditional loop-based approaches. It’s crucial in deep learning due to the massive datasets and complex calculations involved.

Example: Consider adding two vectors [1, 2, 3] and [4, 5, 6]. Instead of looping over each element and adding them individually, vectorization allows us to perform the addition operation directly on the entire arrays, resulting in [5, 7, 9].

Broadcasting

Broadcasting is a mechanism that allows arrays of different shapes to be combined in arithmetic operations. It automatically aligns arrays with compatible shapes and dimensions, eliminating the need for explicit copying or reshaping of arrays.

Example: Consider multiplying a 2D array A of shape (3, 3) by a scalar value 5. Broadcasting allows the scalar value to be automatically expanded to match the shape of A, resulting in element-wise multiplication between A and the scalar value without the need for explicit duplication or reshaping of the scalar.

While this blog provides an overview and a starting point towards deep learning, continual learning, experimentation, and real-world application are key to mastering deep learning and making impactful advancements in the field.