visit
So to start understanding Machine Learning algorithms, you need to understand the fundamental concept of mathematical optimization and why it is useful.In this article, we will go through the steps of solving a simple Machine Learning problem step by step. We will see why and how it always comes down to an optimization problem, which parameters are optimized and how we compute the optimal value in the end.
This dataset can represent whatever we want, like x1 = Age of your computer, x2 = time you need to train a Neural Network for example. Every red dot on our plot represents a measured data point. This plot here represents the ground truth:
All these points are correct and known data entries. The problem is that the ground truth is often limited: We know for 11 computer-ages (x1) the corresponding time they needed to train a NN.But what about your computer? If you are lucky, one computer in the dataset had the exactly same age as your, but that’s highly unlikely. For your computer, you know the age x1, but you don’t know the NN training time x2.
This principle is known as data approximation: We want to find a function, in our case a linear function describing a line, that fits our data as good as possible.We can also say that our function should approximate our data.
In this equation, a defines the slope of our line (higher a = steeper line), and b defines the point where the line crosses the y axis. (Note that the axis in our graphs are called (x1, x2) and not (x, y) like you are used to from school. Don’t be bothered by that too much, we will use the (x, y) notation for the linear case now, but will later come back to the (x1, x2) notation for higher order approximations). To find a line that fits our data perfectly, we have to find the optimal values for both a and b.
For our example data here, we have optimal values a=0.8 and b=20. But how should we find these values a and b?
Well, as we said earlier, we want to find a and b such that the line y=ax+b fits our data as good as possible. Or, mathematically speaking, the error / distance between the points in our dataset and the line should be minimal.
The error for a single point (marked in green) can is the difference between the points real y value, and the y-value our grey approximation line predicted: f(x). It can be calculates as follows:
Here, f is the function f(x)=ax+b representing our approximation line. xi is the points x1 coordnate, yi is the points x2 coordinate. Remember the parameters a=0.8 and b=20? Let’s set them into our function and calculate the error for the green point at coordinates (x1, x2) = (100, 120):
Error = f(x) — yi
Error = f(100) — 120
Error = a*100+b — 120
Error = 0.8*100+20–120
Error = -12
By squaring the errors, we get absolute values (-12->squared->144)With squaring the errors, we get a much higher value for points that are far away from the approximation line. Therefore, if our approximation line misses some points by a far distance, the resulting error will be quite large.
Then, let’s sum up the errors to get an estimate of the overall error:This formula is called the “Sum of Squared Errors” and it is really popular in both Machine Learning and Statistics.
How is this useful? Well, let’s remember our original problem definition: We want to find a and b such that the linear approximation line y=ax+b fits our data best. Let’s say this with other words: We want to find a and b such that the squared error is minimized.
Tadaa, we have a minimization problem definition. We want to find values for a and b such that the squared error is minimized.
If we find the minimum of this function f(a, b), we have found our optimal a and b values:
Before we get into actual calculations, let’s give a graphical impression of how our optimization function f(a, b) looks like:
Note that the graph on the left is not actually the representation of our function f(a,b), but it looks similar. The height of the landscape represents the Squared error.
The higher the mountains, the worse the error. So the minimum squared error is right where our green arrow points to. When we reed out the values for a and b at this point, we get a-optimal and b-optimal.
Going more into the direction of a (e.g. having higher values for a) would give us a higher slope, and therefore a worse error.
If we went into the direction of b (e.g. having higher values for b), we would shift our line upwards or downwards, giving us worse squared errors as well.
So the optimal point indeed is the minimum of f(a,b). But how do we calculate it? Well, we know that a global minimum has to fulfill two conditions:
f’(a,b) = 0 — The first derivative must be zero
f’’(a,b) >0 — The second derivative must be positive
f(a, b) Δa = 0
f(a, b) Δb = 0
Let’s rewrite f(a,b) = SUM [axi+b — yi]² by resolving the square. This leaves us with f(a,b) = SUM [yi² + b²+a²x + 2abxi — 2byi — 2bxiyi]. Let’s fill that into our derivatives:
f(a,b) = SUM [yi² + b²+a²x + 2abxi — 2byi — 2axiyi] Δa = 0
f(a,b) = SUM [yi² + b²+a²x + 2abxi — 2byi — 2axiyi] Δb = 0
f(a,b) = SUM [2ax + 2bxi — 2xiyi] = 0
f(a,b) = SUM [2b+ 2axi — 2yi ] = 0
We can not solve one equation for a, then set this result into the other equation which will then only be dependent on b alone to find b. Finally, we fill the value for b into one of our equal equations to get a.
Why don’t we do that by hand here? Well, remember we have a sum in our equations, and many known values xi and yi. Even for just 10 datapoints, the equation gets quite long. We can let a computer solve it with no problem, but can barely do it by hand.
Congratulations! You now understand how linear regression works and could — in theory — calculate a linear approximation line by yourself without the help of a calculator!But wait, there’s more
These approximation lines are then not linear approximation, but polynomial approximation, where the polynomial indicates that we deal with a squared function, a cubic function or even a higher order polynomial approximation.
The principle to calculate these is exactly the same, so let me go over it quickly with using a squared approximation function. First, we again define our problem definition: We want a squared function y = ax² + bx + c that fits our data best.
As you can see, we now have three values to find: a, b and c. Therefore, our minimization problem changes slightly as well. While the sum of squared errors is still defined the same way:
Writing it out shows that we now have an optimization function in three variables, a,b and c:
One question remains: For a linear problem, we could also have used a squared approximation function. Why? Well, with the approximation function y = ax² + bx + c and a value a=0, we are left with y = bx + c, which defines a line that could perfectly fit our data as well. So why not just take a very high order approximation function for our data to get the best result?
Well, we could do that actually. The higher order functions we would choose, the smaller the squared error would be.
In fact, if we choose the order of the approximation function to be one less than the number of datapoints we totally have, our approximation function would even go through every single one of our points, making the squared error zero. Perfect, right?
Well, not so much. It is easiest explained by the following picture:So we should have a personal look at the data first, decide what order polynomial will most probably fit best, and then choose an appropriate polynomial for our approximation.If you are interested in more Machine Learning stories like that, check out my other posts!