Why Do We Even Have Neural Networks?

Author:Murphy  |  View: 23811  |  Time: 2025-03-22 23:41:02

I have recently been writing a series of articles explaining the key concepts behind modern-day neural networks:

Neural Networks

One reason why neural networks are so powerful and popular is that they exhibit the _universal approximation theorem._ This means that a neural network can "learn" any function no matter how complex.

"Functions describe the world."

A function, f(x), takes some input, x, and gives an output y:

How a mathematical function works. Diagram by author.

This function defines the relationship between the input and output. In most cases, we have the inputs and the corresponding outputs with the goal of the neural network to learn, or approximate, the function that maps between them.

Neural networks were invented around the 1950s and 1960s. Yet, at that time there were other known universal approximaters out there. So, why do we even have neural networks …

Taylor Series

The _Taylor Series_ represents a function as an infinite sum of terms calculated from the values of its derivatives at a single point. In other words, it's a sum of infinite polynomials to approximate a function.

Taylor Series. Equation by author in LaTeX.

The above expression represents a function f(x) as an infinite sum, where f^n is the n-th derivative or order of f at the point a, and n! denotes the factorial of n.

See here if you are interested in learning why we use Taylor Series. Long story short, they are used to make ugly functions nice to work with!

There exists a simplification of the Taylor Series called the Maclaurin series where a = 0.

Maclaurin Series. Equation by author in LaTeX.

Where in this case _a_0, a_1_, etc are the coefficients for the corresponding polynomials. The goal of the Taylor and Maclaurin series is to find the best values of the coefficients to approximate a given target function.

Sounds familiar?

We can even express the Maclaurin series in matrix notation.

Maclaurin Series in matrix notation. Equation by author in LaTeX.

This is pretty much a single-layer neural network! _a_0 is the bias term, a_1 to a_n are the weights, and x to x^n_ are our features.

I like to think of the Taylor series as (loosely) polynomial regression!

In Machine Learning problems, we don't actually have the whole function but rather a sample of data points. This is where we would pass in the x^n features as Taylor features to a neural network to learn the coefficients using backpropagation.

One last interesting property relating the Taylor series to machine learning is gradient descent. The general gradient descent formula comes from applying the Taylor series to the loss function. See here for the proof of this concept.

Fourier Series

The Fourier series is very similar to the Taylor series, but instead uses sines and cosines waves instead of polynomials. It's defined as:

Any periodic function can be decomposed into a sum of sine and cosine waves

This is a very simple statement but its implications are significant.

For example, shown below are the functions sin(2x) and cos(3x) and their corresponding summation:

import plotly.graph_objs as go
import numpy as np

x = np.linspace(0, 3 * np.pi, 500)

y1 = np.sin(2 * x)
y2 = np.cos(3 * x)
y_sum = y1 + y2

trace1 = go.Scatter(x=x, y=y1, mode='lines',name='sine(2x)', line=dict(color='blue'))
trace2 = go.Scatter(x=x, y=y2, mode='lines', name='cos(3x)', line=dict(color='green'))
trace3 = go.Scatter(x=x, y=y_sum, mode='lines', name='sum', line=dict(color='red'))

layout = go.Layout(
    title='Example Sum of Sinusoidal Waves',
    xaxis=dict(title='X'),
    yaxis=dict(title='Y')
)

data = [trace1, trace2, trace3]

fig = go.Figure(data=data, layout=layout)

fig.show()
Example sine waves and their sum. Plot generated by author in Python.

The sin(2x) and cos(3x) functions **** are simple functions yet their summation (red line) leads to a more complex pattern. This is the main idea behind the Fourier series using multiple simple functions to build a complex one.

One of the most interesting results from the Fourier series is being able to construct a _square wave by summing infinite sines (harmonics_) of different odd number (orders) __ frequencies and amplitudes:

Summation of odd sine waves. Equation by author in LaTeX.
import plotly.graph_objs as go
import numpy as np

x = np.linspace(0, 3 * np.pi, 1000)
y = np.array([np.sin((2*k + 1) * x) / (2*k + 1) for k in range(100)]).sum(axis=0) * (4 / np.pi)

trace = go.Scatter(x=x, y=y, mode='lines', name='Square Wave', line=dict(color='blue'))

layout = go.Layout(
    title='Square Wave',
    xaxis=dict(title='X'),
    yaxis=dict(title='Y', range=[-1.5, 1.5])
)

data = [trace]

fig = go.Figure(data=data, layout=layout)

fig.show()
Using sine waves to create a square wave. Plot generated by author in Python.

What's amazing about this result is that we have generated a sharp and straight-line plot from smooth sine functions. This shows the true power of the Fourier series to construct any periodic function.

The Fourier series is often applied to time series to model complex seasonal patterns. This is called harmonic regression.

As declared earlier, the Fourier series states that any periodic function can be broken down into a sum of sine and cosine waves. Mathematically, this is written as:

Fourier series. Equation by author in LaTeX.

Where:

  • _A_0: average value of the given periodic function_
  • _A_n: coefficients of the cosine components_
  • _B_n: coefficients of the sine components_
  • n: the ** order which is the frequency of the sine or cosine wave, this is referred to as the ‘harmonic**s
  • P: period of the function

Likewise, with the Taylor series, our aim with the Fourier Series is to find the coefficients _A_n and B_n_ to our features, which in this case is the sine and cosine function.

Why Use Neural Networks Then?

Both the Taylor and Fourier series can be viewed as universal function approximators and they both predate the neural network. So, why on earth do we have neural networks?

Well, the answer is not straightforward as there are many intricacies between the three methods. I have been fairly liberal when describing how the Taylor and Fourier series work, otherwise this article would be very, very exhaustive.

Let's break down some reasons why the Taylor or Fourier series can't replace a neural network.

Taylor Series

The main issue with the Taylor series is that they approximate around a single point. They are estimating a function over one value and its local region. We want to know what the whole function looks like over a large range. This means the Taylor series (polynomials) fails to generalise outside the training set.

Fourier Series

One issue with the Fourier series is that it needs to see the function it's going to approximate. For example, in time series it's used to find the complex seasonal pattern in the data. But, it knows what the data looks like. A neural network aims to learn this function.

However, the main problem is the complexity of the Fourier series. The number of coefficients increases exponentially with the number of variables in the function we are trying to estimate. However, for a neural network, this is not necessarily the case.

Let's say we have a function f(x), that we can approximate well with 100 coefficients. Now suppose we want to approximate f(x,y). Instead of having 100 coefficients, we now have 100² = 10,000. For f(x,y,z), we have 100³. And this process goes on, increasing exponentially.

What I am describing here is the _curse of dimensionality._

Neural networks on the other hand can accurately model (some) of these high dimensional functions without increasing their input dimensions too much.

No Free Lunch Theorem

It is important to mention that neural networks will not always be better than the Taylor series and Fourier series. The beauty of machine learning is that it is the science of maths. You have to play around when fitting your model to find the best one. It may well be that adding Taylor or Fourier features will improve it. However, it may also make it worse. The goal is to find the best one, but this is different for every dataset.


Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!

References and Further Reading

Tags: Artificial Intelligence Data Science Machine Learning Optimization Statistics

Comment