What is Polynomial Regression?

This article is an explanation of the math used for our polynomial regression class.

Simply put polynomial regression is an attempt to create a polynomial function that approximates a set of data points. This is easier to demonstrate with a visual example.

In this graph we have 20 data points, shown as blue squares. They have been plotted on an X/Y graph. The orange line is our approximation. Such a scenario is not uncommon when data is read from a device with inaccuracies. With enough data points the inaccuracies can be marginalized. This is useful in physical experiments when modeling some phenomenon. There is often an equation and the coefficients must be determined by measurement. If the equation is a polynomial function, polynomial regression can be used.

Polynomial regression is one of several methods of curve fitting. With polynomial regression, the data is approximated using a polynomial function. A polynomial is a function that takes the form fx ) = c0 + cx + c2 x2 ⋯ cn xn where n is the degree of the polynomial and c is a set of coefficients.

Most people have done polynomial regression but haven't called it by this name. A polynomial of degree 0 is just a constant because fx ) = c0  x0 = c0. Likewise preforming polynomial regression with a degree of 0 on a set of data returns a single constant value. It is the same as the mean average of that data. This makes sense because the average is an approximation of all the data points.

Here we have a graph of 11 data points and the average is highlighted at 0 with a thick blue line. The average line mostly follows the path of the data points. Thus the mean average is a form of curve fitting and likely the most basic.

Linear regression is polynomial regression of degree 1, and generally takes the form y = m + b where m is the slope, and b is the y-intercept. It could just as easily be written fx ) = c0 + c1 x with c1 being the slope and c0 the y-intercept.

Here we can see the linear regression line running along the data points approximating the data. Mean average and linear regression are the most common forms of polynomial regression, but not the only.

Quadratic regression is a 2nd degree polynomial and not nearly as common. Now the regression becomes non-linear and the data is not restricted to straight lines.

Here we can see data with a quadratic regression trend line. So the idea is simple: find a line that best fits the data. More specifically, find the coefficients to a polynomial that best fits the data. To understand how this works, we need to explore the math used for computation.

The Math Behind Polynomial Regression

This section will attempt to explain the math behind polynomial regression. It requires a solid understanding of algebra, basic linear algebra (using matrices to solve systems of equations), and some calculus (partial derivatives and summations). The algebra is worked out, but the computation of matrices and derivatives are not. Someone with access to a computer algebra system (like WolframAlpha) should therefore be able to follow the math knowing only algebra and treating the higher level math functions like a black box.

Polynomial regression is an overdetermined system of equations that uses least squares as a method of approximating an answer. To understand this let us first look at a system of equations that is not overdetermined. We can start by constructing a line function that intercepts two points: (0.1, 0.1) and (0.6, 0.8). First let us solve the problem algebraically. Remember the equation for a line:

There are two unknowns in this equation: m and b. With two unknowns we need two solutions. These we have from the two given data points. So let us create two equations from this data:

Now to solve for the system, we can start with the top equation and solve for one of the unknowns. Let us choose b:

With b known, let us substitute the solution into the bottom equation:

Simplify this a bit:

Now we have a single equation with one unknown, which we can solve for:

Now that m is known, we can choose either of our original equations and solve for b. Let us choose the top equation and solve for b.

So with m and b known, our original equation can be turned into the line function:

This is a system of equations, and this is a pretty textbook example of how to solve it algebraically. Rather than use the algebraic approach we can also solve this system using matrices and linear algebra. The reason we are using matrix math is to simplify higher order polynomials, and we will get to those latter. Let us first rearrange the system of equations and add in the redundant multiplier of 1 in front of the b coefficient.

Now we can place the system into matrix form:

And using linear algebra, we can solve.

Here I use matrix inversion to solve for the constant terms. You could setup the equation using an augmented matrix and use Gaussian elimination like this:

Or solve the system using Cramer's Rule.

Or easier still use WolframAlpha to compute the result.

Regardless the results are the same, but using matrix representation will make things easier when it comes to applying the math to higher order polynomials.

Now onto the overdetermined system. Let us say we now have 3 data points rather than two, and these points are: (0.1,0.1), (0.35,0.45), and (0.6,0.8).

We end up with 3 equations and two unknowns. That is too much data. If these points are on the same line, we could simply ignore one of the equations. However, if they are not we need a way of calculating coefficients that take all our data points into consideration. This is where least squares come in.

To preform a least square approach we need to define a residual function. This is a function that specifies the amount of error between our true data, and that produced by our estimation function. As a very simple example, let us say we have the data point (10,8) and our line function is = 2 x - 12. Fill this in for = 10 and we get = 2 (10) - 12 = 6. The residual is the difference between the observed value (8) and the estimated value (6), or 8 – 6 = 2. For any point (xi, yi) on the line, the residual would be:

Where i is the index into the set of known data points. We can generalize this for any known data points:

In fact, we can generalize it for any function:

Now consider that we have a set of data points. What does the residual tell us? Well, for each data point the residual denotes the amount of error between our estimation and the true data. How about the overall error? To get this, we could just sum up the error for all data points. However error can be positive or negative. So we could end up with a lot of error uniformly distributed between negative and positive values. For example, if our error was (6, -8, 2) then our total error sum would be 0. So this doesn't give a good indication of error. We could sum the absolute value of all the error. The absolute value is equivalent to:

Then our (6, -8, 2) absolute error sum would total 16, and that's more like what we want. Since we just want an indication of error, we can drop the square root as the square function is enough to produce the always positive value we are looking for. There is an other reason to use just the square that will become clear latter.

So now that we have a function that measures residual, what do we do with it? Well, if we are trying to produce a function that models a set of data, we want the residual to be as small as possible—we want to minimize it. This would produce a function that deviates from the observed data as little as possible.

To minimize, we need to be able to modify the function's coefficients as the coefficients are the variables we have to manipulate for finding the best fit. In our line function y = m x + b, the coefficients are m and b. There is an added benefit to squaring the residuals—the square of residual forms a parabola. To see why this is useful, consider a 1st degree polynomial with three known points (10, 8, 11). Let's make the function:

You will notice x isn't used, but this is legitimate. Use this to construct the residual function:

And expand this for our data:

If we multiply everything and collect like terms we end up with:

Now graph this function, but use c0 as the variable:

The graph is a parabola. A parabola always has one and only one lowest point—it's vertex. The lowest point is the point with the lowest amount of error. So if we find coefficients that yield the vertex we have the coefficients that produce the lowest error.

Finding the vertex of a parabola for a function that has a single coefficient is easy. Take the derivative of the function and find where that function is equal to zero. This works because any time the derivative of a function is equal to zero, that point is either a local maximum or minimum. A parabola only has one such point: the minimum.

Using our example start by taking the derivative:


And set this equal to zero:

Now solve for c0:

So the coefficient that best fits this data is c0 = 9 2/3. Our function is then:

We have just solved a 0 degree polynomial using least squares, and our coefficient is the average of our data set:

As stated, the mean average is polynomial regression with a 0 degree polynomial. The reason this example was worked out was because it is easy to visualize. A function that has more than one coefficient produces a multidimensional equation. For example, the line function has two coefficients, m and b. The residual square function will produce a paraboloid.

This is the graph of a simple paraboloid. There is still an overall minimum—the bottom of the bowl shape—but there are two variables involved in finding the minimum. This holds true for functions with three or more coefficients, but we cannot graph these to make a demonstration.

To find the minimum of these problems the derivative can still be used, but we need to use partial derivatives—one with respect to each coefficient. This results in an equation for each coefficient, and that makes sense. To solve a system of equations, we need an equation for each unknown term. And using this method this is exactly what we get.

For the line function, we need the partial derivative with respect to m and with respect to b. Let us run this setup for the residual for some overdetermined function. First the line function:

Now the residual square sum when there are n known points:

Let us work through this using the three known points mention earlier: (0.1,0.1), (0.35,0.45), and (0.6,0.8)

Clean this up:

To minimize we need to take the partial derivative of the function r with respect to the two coefficients, m and b.

(Solution for r with respect to b, and r with respect to m.)

Now set these partial derivatives equal to zero:

We now have two equations with two unknowns. With a bit of rearranging we can get this into matrix form. First, the 6 goes away by dividing it into the other side:

Next we'll move the constant to the right side:

Now place the equations into a matrix:



In this example all points have the same slope and intercept. Below is a graph of the 3 data points, and the regression line using the computed slope and intercept m and b respectively.

We could have just dropped one of the equations and arrived at the same result. However, this method will work if each of the points varies slightly from the slope and intercept, and it will compute values for m and b such that they have the smallest amount of residual.

While the steps above work for our one set of points, let us try creating a more generalized version of the regression equation for the line function. We will again start with the sum of the residual squared:

Take the partial derivative of the function r with respect to the two coefficients, m and b.

(Solution for b, and solution for m.)

Set these partial derivatives equal to zero:

It looks a little messy, but we in fact have two equations with two unknowns. We can convert this into a matrix so it can be solved. Some rearranging is needed for this to happen. First, some expansion:

Break up the summations:

Move the constants out of the summations:

We can reduce the summation of the constant 1:

Move all parts dealing with y to the right side of the equation:

From here it is easy to place the equations into matrix form:

The 2 can be factored out and cancels on both sides.

What we are left with is precisely the matrix definition of linear regression. To test this, let us use the three data points from above: (0.1,0.1), (0.35,0.45), and (0.6,0.8). First we need the data calculated out:

Now fill this data into the matrix:



So both methods agree with one an other. We can repeat this process for a polynomial of any number of degrees. Let's try using a 2nd degree polynomial to get the results for quadratic regression.

First, the quadratic function:

Now the sum of squares for n known data points:

There are three partial derivatives as there are three coefficients.

(Solution for c0, Solution for c1, and Solution for c2.)

Set the partial derivatives equal to zero.

Eliminate the 2, split the summations, and move y to the right:

Reduce and pull out constants:

And translate into matrix form:

We get the results for quadratic regression in matrix form. There is a pattern forming we can seen more clearly by generalizing the partial derivative:

Where m is the degree of the polynomial, and n is the number of known data points. This, if you follow the math, leads to a generalized matrix:

This is the matrix representation of polynomial regression for any degree polynomial. And that's it. An overdetermined system is solved by creating a residual function, summing the square of the residual which forms a parabola/paraboloid, and finding the coefficients by finding the minimum of the parabola/paraboloid.


There are two issues that have an effect on the full implementation of polynomial regression. First is solving the system of equations. In software using an augmented matrix and then Gaussian elimination works well, but the row operations that are so trivial on paper get a little more complicated in software.

The second issue only effects polynomials of higher degrees. The summations will get very large. For example, a 2nd degree polynomial (quadratic) needs a sum of x4. A 3rd degree polynomial needs x6 and an 8th degree polynomial all the way to x16. These sums get large (or very small) quickly and can have result beyond what can be stored in conventional floating point numbers. To deal with this the implementation must use arbitrary-precision arithmetic.