The genius remembers the address of this site in one second: [] The fastest update! No ads!

For the training process of this neural network, it is necessary to determine these 11935 parameters.

The goal of training can be roughly summarized as: for each training sample, the corresponding output is infinitely close to 1, while other outputs are infinitely close to 0.

According to the experimental results given by Michael Nielsen, based on the above network structure, the correct recognition rate of 95% can be easily achieved without tuning. And the core code is only 74 lines!

After adopting deep learning ideas and convolutional networks (;networks), the correct recognition rate of 99.67% was finally achieved. The best historical result achieved for the MNIST data set is a recognition rate of 99.79%, which was made by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus in 2013.

Considering that there are also some illegible numbers in this dataset like the following, this result is quite amazing! It has surpassed the recognition of real human eyes.

To adjust the weight and bias parameter values ​​step by step in this process, the gradient descent algorithm (gradient descent) must be introduced.

During the training process, our neural network needs to have a practical and feasible learning algorithm to gradually adjust the parameters.

The ultimate goal is to make the actual output of the network as close as possible to the expected output. We need to find an expression to characterize this closeness. This expression is called the cost function (cost )

x represents a training sample, the input to the network. In fact, one x represents 784 inputs.

y(x) indicates the expected output value when the input is x; and a indicates the actual output value when the input is x. Both y(x) and a represent 10 output values ​​(expressed in mathematical vectors). The square of their difference represents the closeness between the actual output value and the expected output value. The closer it is, the smaller the difference.

n is the number of training samples. Suppose there are 50,000 training samples, then n is 50,000. Because it is trained multiple times, it is necessary to divide by n to average all training samples.

The notation of C(w, b) is to regard cost as a function of all weights w and bias b in the network. Why do you look at it this way? When training,

The input x is fixed (training samples) and will not change. When the input is considered unchanged, this formula can be regarded as a function of w and b. So, where are the w and b on the right side of the formula? Actually, inside a. y(x) is also a fixed value, but a is a function of w and b.

In summary, C(w,b) characterizes how close the actual output value of the network is to the expected output value. The closer it is, the smaller the value of C(w,b). Therefore, the process of learning is the process of finding ways to reduce C(w, b), regardless of the expression form of C(w, b), it is a function of w and b, which becomes a function of finding the minimum value optimization problem.

Since the form of C(w, b) is relatively complex and there are many parameters, it is very difficult to directly solve it mathematically.

In order to use computer algorithms to solve this problem, computer scientists proposed the gradient descent algorithm (gradient descent).

This algorithm essentially takes a small step down each time along the direction of the tangent contribution of each dimension in the multi-dimensional space, so as to finally reach the minimum.

Because multi-dimensional space cannot be represented visually, people usually retreat to three-dimensional space for analogy. When C(w, b) has only two parameters, its function graph can be presented in three-dimensional space.

It's like a small ball rolling down the slope of a valley, and eventually it may reach the bottom of the valley. This understanding is also basically established when it is extended to multi-dimensional space.

However, due to the large number of training samples (tens of thousands, hundreds of thousands, or even more), if the calculation is performed directly based on the previous C(w, b), the amount of calculation will be large, resulting in a slow learning process.

, So the stochastic gradient descent (stochastic gradient descent) algorithm appeared, which is an approximation to the gradient descent.

In this algorithm, each learning is no longer aimed at all the training sets, but randomly selects a part of the training set to calculate C(w, b), and then randomly selects a part of the remaining training set to calculate the next learning, until Use up the entire training set. Then keep repeating the process.

Deep neural networks (with multiple hidden layers) have more structural advantages than shallow neural networks in that they have the ability to abstract from multiple levels.

Since the 1980s and 1990s, researchers have been trying to apply the stochastic gradient descent algorithm to the training of deep neural networks, but they have encountered the problem of vanishing gradient or exploding gradient, which leads to learning The process is insanely slow, and deep neural networks are largely unusable.

However, since 2006, people have begun to use some new techniques to train deep networks, and breakthroughs have been made continuously. These techniques include but are not limited to:

Using convolutional networks (;networks);

;(dropout);

Rectified linear units;

Utilize the GPU to obtain stronger computing power, etc.

The advantages of deep learning are obvious: this is a new way of programming, it does not require us to directly design algorithms and programming for the problem to be solved, but for the training process programming.

The network can learn the correct way to solve the problem by itself during the training process, which allows us to use simple algorithms to solve complex problems and outperform traditional methods in many fields.

And training data plays a more important role in this process: a simple algorithm plus complex data may be far better than a complex algorithm plus simple data.

Deep networks often contain a large number of parameters, which does not conform to the principle of Occam's razor in philosophical principles, and people usually spend a lot of energy on adjusting these parameters;

Training a deep network requires a lot of computing power and computing time;

The problem of overfitting (Overfitting) is always accompanied by the training process of the neural network, and the problem of slow learning has always plagued people, which easily makes people feel a fear of losing control, and at the same time has a negative impact on the further application of this technology in some important occasions. created obstacles.

And the story of BetaCat tells the story of an artificial intelligence program that gradually rules the world through self-learning.

So, will the current development of artificial intelligence technology cause this to happen? I'm afraid this is not yet possible. Generally speaking, there are probably two important factors:

First, as far as the current artificial intelligence is concerned, its self-learning is still limited to the way people specify, and it can only learn to solve specific problems, and it is still not general intelligence.

Second, for the training process of artificial intelligence, people need to input standardized training data for it. The input and output of the system still have strict requirements on the format of the data, which means that even if the artificial intelligence program is connected to the Internet, it Nor can it learn from the massive amount of unstructured data on the Internet like BetaCat.

However, this is only for ordinary artificial intelligence, but for a real network intelligent life like Origin, the above two requirements require it to be fully capable.

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

You'll Also Like