Gradient descent is an iterative algorithm used to find a global minimum, local minimum, or local maximum of a given function..
The gradient descent method is widely used in machine learning or machine learning and in deep learning, also called deep learning. What this method seeks is to create optimization algorithms that allow the error function and loss function to be smaller.
Types of gradient descent
There are different types of gradient descent that can be used in Big Data and they are very different from each other. These are:
Batch Gradient Descent
Calculate the error for each example within the training data set. Its greatest advantage is its stability in producing an error gradient and convergence.
Stochastic gradient descent
Analyze the parameters of each training example, one by one. One of its greatest advantages lies in its constant updates, which in turn makes it an expensive option. It becomes a double-edged sword.
Mini-batch gradient descent
Split the training data set into small batches and perform an update for each of them. It is a mix of the best of stochastic gradient descent and batch gradient descent.
Vectorial space
Most of the functions to be optimized do not take a scalar as input, but rather a vector. In this case, The vector will define a space in which the value that produces the minimum value will have to be found.
Up to this point we have seen the derivative of a function. Now, let’s introduce the concept of partial derivative.
In a function ƒ : n -> the partial derivative ∂/∂x ƒ(x->) measures how ƒ(x->) changes as a function of the variable xi; this is the ith column of the vector x->.
For example, the partial derivative with respect to x₁ is:
The partial derivative with respect to x2 is:
The gradient is the derivative with respect to the vector x->, which is a multivariate generalization of the derivative. It is denoted by the symbol Δƒ. The result is a vector, unlike the derivative, whose result is a scalar.
Here u-> and v-> are the unit vectors of the coordinates given by x1 and x2, respectively.
The gradient indicates the direction of maximum growth of the function and its negative value indicates the direction of maximum decrease.
Thus, the gradient descent method is based on finding the direction of maximum decrease of the function and updating the value of x-> accordingly.
∈ is a constant, a hyperparameter that must be defined at the beginning of the iteration. It is known as learning rate (leaming rate). A small value can cause many iterations to be needed to converge, a large value can cause us to miss the maximum of the function.
The gradient descent method converges when all elements of the gradient are zero (or very close to zero).
The gradient descent method is considered a first-order algorithm, because it uses information only from the first derivative. It never uses information from the second derivative.
As an example
We will create a function that calculates the gradient:
in [77]: myfuncVector<- function(x) { x[1]^2+3*x[2]^2} mygradient<- function(x,h, func){ c((func(c(x[1]+h, x[2]))-func(x))/h, (func(c(x[1],x[2]+h))-func(x))/h) } mygradient(c(4,3),0.001,myfuncVector)
The results of this formula would be 8.00100000000725 and 18.0030000000002.
in [ ]: library(plotrix) x_0 <- c(-9, -9) delta <- 0.15 h<- 0.001 convergence_criterion<-0.05 x_historical<- data.frame(x=x_0[1],y=x_0[2]) for (i in 1:100) { g<- mygradient(x_0,h,myfuncVector) if(abs(sum(g))
The minimum function is at 0. The 8 and 18 are the gradient at the point 4.3; This is where the maximum gradient arrow is pointing:
mygradient(c(4,3),0.001,myfuncVector)
The graph that is painted is that of { x[1]^2+3*x[2]^2}.
Those seen in black are the contour lines. The graph is telling us that, at point 4.3 (marked in red), the gradient will point towards where it grows (marked in blue).
Thus, the gradient grows 8 points at x (horizontal) and 18 points at y (vertical). When we subtract this result from the value of x, it will give us a value that is closer to 0 than before.
Whats Next?
In this article we have seen how to implement the gradient descent method in predictive analysis, but there is much more to see and know. Keep checking our posts to continue learning; Although if what you are looking for is intensive training that opens the doors to the job market in the IT sector, you cannot miss our Big Data, Artificial Intelligence & Machine Learning Full Stack Bootcamp. You will learn quickly and easily everything you need to become a great professional. Ask for information and change your life!