Deep Learning vs Data Science: Who Will Win?

Author:Murphy  |  View: 22469  |  Time: 2025-03-22 19:57:30
Source: image by author

The two opponents walk into the ring, each claims to have the upper hand. The data scientist pulls out a silver ruler, the deep learning developer pulls out a gleaming hammer – who will build the best model?

In my previous positions, I've worked as both a data scientist and a deep learning algorithm developer. If you ask me what the differences are between the two, I've got to say that it's not clear-cut.

Both deal with data and machine learning models, and both use similar success metrics and working principles.

So what makes them different?

I think its the attitude.

I'll be bold and generalize that from my experience, deep learning developers (especially junior ones) tend to focus more on the model, while data scientists do the opposite – they analyze and manipulate the data such that almost any model will do the trick.

Or, should I dare to simplify it even further and say that:

Deep Learning = Model Oriented

Data Science = Data Oriented

That said, these distinctions have finally started to blend in recent years. However, in my opinion, not fast enough since the **** approaches can and should be mutually beneficial.

That said, one approach is definitely more important than the other – and I can prove it!

In this post, I'll walk you through ** a classic story of how a machine learning project can play out, which illustrates these different approaches (model-oriented and data-oriented) and highlights which one reigns suprem**e.

And, of course, I'll show you how to "SOLVE" this machine learning problem and build a high-powered and accurate model.


Problem Setup

You are a machine learning engineer working on a project with a well-known bank – "The International Bank Of Mars". They want you to develop an algorithm that distinguishes between two kinds of customers: regular customers and criminals.

These criminals are known to be doing all kinds of fishy activities in their bank accounts, so the bank thinks there is some useful information there that can give them away.

Simple enough, right?

You tell them: "This is just a simple binary classification problem – show me the data, and I'll knock it out in a single afternoon!", "Why don't you just do it?".

They tell you that they don't know how to train and do inference for models at scale, which is why they came to you.

Fine.

So, what does the data look like?

They give you snapshots of different customer accounts. Specifically, snapshots of the checking and savings accounts, as well as the top 2 most used credit cards – a total of 4 "accounts."

Easy peasy.

But wait, there's a catch.

The International Bank Of Mars can't just give you the data. What if someone intercepts and copies it? This would be a terrible data breach!

So, they take several steps to "anonymize" the data:

  1. No identifiers: They don't give you any identifiers for the snapshots of the accounts – you don't know if these are different accounts or different snapshots of the same accounts.
  2. Transformed data: They don't give you the raw data since this could be cross-referenced with some partial data the attacker may obtain to recover more information about their other accounts. Instead, they transform the data using reversible linear and non-linear transformations.

Think of the bank's data transformations like a secret code – without the key, the attackers will be trying to solve a puzzle with missing pieces. Luckily for you, though, they gave you the key to help you with this project.

The bank gives you the information needed to "undo" the transformations they applied. Furthermore, they say that since the transformations are reversible, "no information should be lost. Thus, it shouldn't hurt your model."


Before we get technical, I challenge you to stop and think:

How would you approach this?


Before We Start

I want to make sure you have all the information and code you will need to reproduce my results in this article. So you'll need to install a few packages, set up some imports, and define a few constants and helper functions along the way.

My goal is to give you code you can just grab, run, and play around with.

Okay, back to the project.

1. Installations

You're gonna use some cool libraries here, so you should probably install the following packages:

  1. [Fast KAN](https://github.com/ZiyaoLi/fast-kan/tree/master)This paper shows a very powerful neural network, a variant of Kolmogorov Arnold Networks (KAN) that uses Radial Basis Functions (RBF). It is known to converge very fast – hence the name.
git clone https://github.com/ZiyaoLi/fast-kan
cd fast-kan
pip install .
cd ..
  1. [Umap](http://UMAP) – This is a great dimensionality reduction technique! It is highly recommended when you want to visualize high-dimensional data. Your data only has 4 dimensions, but this method should still help.
  2. [tqdm](https://tqdm.github.io/) – This is a library I like to use to keep my sanity when training neural networks. It gives you a progress bar which is sooooo essential.

And, of course, you're gonna need many of the regular libraries: [torch](https://pytorch.org/get-started/locally/), [scikit-learn](https://scikit-learn.org/1.5/install.html), [matplotlib](https://matplotlib.org/stable/install/index.html), [numpy](https://numpy.org/install/), etc.


2. Imports, constants, and seeds

You wanna work in an organized manner, so you start with some imports, constants, and seeds for reproducibility.

<script src="https://gist.github.com/BjBodner/f91ae8cd936556c38a451449c376fcfa.js"></script>

3. How do we get that sweet, sweet data?

The bank didn't wanna give you their data, but I got your back.

For the purpose of this toy project, I made a few data generation helper functions here, which generate the simulated data from the 4 accounts we talked about above in the problem statement, as well as the linear and nonlinear transformations.

How are the transformations implemented?

The bank implemented a linear transformation, which was just a 4X4 matrix of random numbers sampled from a normal distribution. It is used to multiply the data, creating 4 linear combinations of the 4 accounts.

You must be asking:

"Does the bank give me this matrix? "

Yes, of course; otherwise, you wouldn't be able to reverse the transformation.

Okay, but what is the non-linear transformation?

They use a very simple non-linear transformation, they multiply each feature by 10000 / (i + 1), where 0 <= i <= 3 is the index of each index. After this, they apply an exponent [np.exp](https://numpy.org/doc/stable/reference/generated/numpy.exp.html) to make this transformation nonlinear but reversible.

Not too bad, right?

Here's the code:

<script src="https://gist.github.com/BjBodner/8b84257d31384c5781fedb91328bf5db.js"></script>

Okay, now that we got all the setup out of the way, you can start working on the project.

Ready? Let's go!


1. EDA First

Regardless of the task, even if you are a "model-first" person, it's always good to start with a good exploratory data analysis (EDA) to get to know the data.

Finally, you got your hands on the data from the bank, and you just can't wait to visualize it and start looking for patterns!

You start by plotting the different distributions of the data, as well as relationships between the features.

<script src="https://gist.github.com/BjBodner/726e10e28aef3a6ff388f6db6b681869.js"></script>

Also, since you know that these plots don't capture the richness and full structure of high-dimensional spaces, you use several dimensionality reduction methods and then visualize them in 2d. You try to use a bunch of them: [t-SNE](https://scikit-learn.org/1.5/modules/generated/sklearn.manifold.TSNE.html), [LDA](https://scikit-learn.org/dev/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html), [PCA](https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html), [FastICA](https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.FastICA.html), [KPCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html#sklearn.decomposition.KernelPCA), [UMAP](https://umap-learn.readthedocs.io/en/latest/).

Don't know what all these acronyms mean?

No worries.

These are just all kinds of ways you can reduce the number of features in your data. Then, if you reduce them to only 2 features, you can visualize them in a regular scatter plot, which can give you a lot of insights about your data!

<script src="https://gist.github.com/BjBodner/43b9482e40e6ab0675880c6a394d30a3.js"></script>

What did you get from the EDA?

We'll you got something

Run this code block, and you will get the data from the bank, as well as all the visualizations. This will create two visualizations for the data distributions and dimensionality reduction visualizations:

<script src="https://gist.github.com/BjBodner/f529ffcbbd3b6ae86309e3065f4706d1.js"></script>
Source: image by author – distrbutions of original data from bank.

Hmm, this looks pretty bad.

Can you see any pattern that jumps out?

I don't. It seems that all the distributions overlap between the two classes – blue for customers and red for criminals.

However, what do you think? Will the dimensionality reduction methods provide better information?

Source: image by author – dimensionality reduction methods on original data from bank.

Not really…

Seems that most of the distributions overlap, though the patterns do look cooler now. But actually, the [Umap](https://umap-learn.readthedocs.io/en/latest/) is a bit encouraging; it shows some separation between the red and blue dots.

Maybe it fine, hopefully your model will be able to learn these separations.

2. Let's Train The Models

Finally! Let's see which model is better?

I always start with a simple model- if it works, that's great! and if not, at least you have a baseline to teach you what really matters.

You start with implementing some models in increasing orders of complexity and expected performance.

  1. [LogisticRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html): a super simple model to establish a baseline.
  2. [RandomForestClassifier](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html): a stronger but simple model.
  3. [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron): a deep neural network using several layers.
  4. [KAN](https://arxiv.org/abs/2404.19756): same structure as the [[MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron)](https://en.wikipedia.org/wiki/Multilayer_perceptron) but using the[Fast-KAN](https://github.com/ZiyaoLi/fast-kan/tree/master), which is superior to MLPs in many cases.

After training all the models, you plot the error rate of each model, which is the primary metric for this use case:

error_rate = 1 - accuracy

To make things easier, you implement several helper functions for training and visualization of the results:

<script src="https://gist.github.com/BjBodner/e3db216dd49a09a5cf1e532f6b3e53da.js"></script>
<script src="https://gist.github.com/BjBodner/3c0e66a675f85c0da22abe906f9d58ca.js"></script>
<script src="https://gist.github.com/BjBodner/6a361305cb5d3aefac3a22be70c5ec35.js"></script>
<script src="https://gist.github.com/BjBodner/a23c5087a1e8c08d03f18c9553fa73db.js"></script>

Come on, let's train these babies!

Training is actually pretty simple now that you have all those helper functions!

<script src="https://gist.github.com/BjBodner/9f9159f447d530f0d816ca8d83ffa80f.js"></script>

Running these bars should give you a bar plot with these results:

Epoch 2000 - Loss: 0.6516: 100%|███████████████████████████████████████████████| 2000/2000 [00:21<00:00, 93.80it/s]
MLP Model Accuracy: 0.4350
Epoch 2000 - Loss: 0.0315: 100%|███████████████████████████████████████████████| 2000/2000 [01:10<00:00, 28.19it/s]
FastKAN Model Accuracy: 0.4800
{'MLP': 0.565, 'KAN': 0.52, 'RF': 0.475, 'LR': 0.51}
Source: Image by author – error rates when training models on original data.

Here, LR stands for the [LogisticRegression](https://scikit-learn.org/1.5/modules/generated/[sklearn](https://scikit-learn.org/stable/).linear_model.LogisticRegression.html) model, and RF for the [RandomForestClassifier](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Both are used from sklearn.

What do you think of these results?

I think they look bad.

I mean, I guess the random forest did a bit better, but all of our models have error rates of around 50%, meaning they are about as good as a coin toss!

Do you think a stronger model will help in this situation?

Probably not.

If this was a model size or model capacity thing, you would expect at least some of the models to be able to learn something. This would give at least some merit to the theory that a bigger model will help. But looking at these results, such an experiment would be hopeless.

What did you miss?


Data Science To The Rescue

Remember those reversible transformations that the client said they used?

You initially wanted to try to reverse them but decided to first try a naive approach of doing a minimal EDA and jumping right into the modeling.

It was worth a try, but look where that got you…

Reversing the transformation

Let's do some feature engineering.

Let's use the knowledge we have from how this data was created. We'll try to invert the transformations applied to the data in two ways.

  1. That 4x4 matrix of linear combinations – let's just [invert](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html) that matrix.
  2. That exponentiation transformation they applied, let's take the [log](https://numpy.org/doc/stable/reference/generated/numpy.log.html).

This use of knowledge to craft meaningful features is typically known as feature engineering and is widely used in data science projects.

<script src="https://gist.github.com/BjBodner/7a135646ffb69fc161ad262ae90da327.js"></script>

If you apply these two things, you should effectively recover the data with snapshots of the different bank accounts and credit cards.

Don't believe me?

Run this test. It will generate simulated data and try to apply the transformations that the bank uses. Then, the reverse transformation is applied, and an assertion checks to see if you get back the original data.

<script src="https://gist.github.com/BjBodner/103f498502ea77d54d28e391a5fc71f0.js"></script>

EDA Take Two

Let's repeat the EDA again, but this time on the recovered data and see what we get:

<script src="https://gist.github.com/BjBodner/c54e5d066d6709ab1d8f0119134ab137.js"></script>
Source: image by author. Distrbutions of recovered data from bank.
Source: image by author – dimensionality reduction methods on original data from bank.

Okay wow!

This is a whole new world

The classes are so well separated now! At least when we look at the first feature and in about half of the dimensionality reduction plots. Now, this is something we can work with!

Even a linear model should work pretty well here! Let's try that then.


Training The Models Again

Let's retrain all the models on this recovered data and see if the results change.

<script src="https://gist.github.com/BjBodner/999cad14ca9b98f52a2298da6ecaf885.js"></script>

Running this block should give you the following outputs:

Epoch 2000 - Loss: 0.1552: 100%|███████████████████████████████████████████████| 2000/2000 [00:22<00:00, 88.04it/s]
MLP Model Accuracy: 0.8900
Epoch 2000 - Loss: 0.0411: 100%|███████████████████████████████████████████████| 2000/2000 [01:11<00:00, 28.14it/s]
FastKAN Model Accuracy: 0.8800
{'MLP': 0.10999999999999999, 'KAN': 0.12, 'RF': 0.06000000000000005, 'LR': 0.09999999999999998}
Source: Image by author – error rates when training models on recovered data.

These error rates are way lower than the previous ones. Here, we can see that the models actually learned something!

How can this be?


Why Are These Results So Different?

What happened is that this transformation of the bank worked extremely, really well –a bit too well, actually. It helped hide the original information not only from people but from machine learning models as well!

The combinations of the different features mixed up differently scaled features, and the exponentiation skewed all the data distributions. Both of these together acted as a "double whammy" that none of our models managed to untangle – not with the amount of data we got.

When we look at the model error rates side by side, when trained on the different transformations of the data, we see a striking trend:

<script src="https://gist.github.com/BjBodner/9698fafabe073801d09f1c7671f37ae5.js"></script>
Source: Image by author – performance of models on original (left) and recovered (right) data, marked with "R_" prefix.

The models all performed about 5X better in terms of reduction error_rate, and the [RandomForestClassifier](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html) even showed an almost 8X improvement from 47.5% to 6.0%!

You might also notice that the neural networks, the[Fast-KAN](https://github.com/ZiyaoLi/fast-kan/tree/master) and [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron), performed slightly worse than the simpler [RandomForestClassifier](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [LogisticRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) models, in both data scenarios. This is actually expected since standard Deep Learning models are known to perform worse on such tabular data.

That said, these differences in the performance of the different models are totally negligible compared to the performance boost we got from the feature engineering of reversing the transformations!


What Does This All Mean?

  1. Machine Learning models aren't magic. If you can't see any patterns in the data, it is not very likely that a machine learning model will be able to see them either, especially if data is limited.

When I first started out, I was obsessed with models. I'd spend hours tweaking hyperparameters and different architectures, hoping a specific combination would unlock huge performance gains – until a simple data fix outperformed my best models. This is exactly what happened here.

2. Don't let others touch your data! Any tampering with your data can make or break your machine-learning models, even if the changes seem harmless.

  1. And this is the most important one, every time you start a new project or hit a wall in terms of performance, remember that:

Data first always wins.

This is not to say that powerful models are useless. Often, you have to use powerful deep learning models, especially when handling unstructured data such as images, video, text, and audio.

However, even in these cases, the focus in academia and industry has gradually been shifting to put a larger emphasis on high-quality data from which the model can learn useful things. The Phi series of small language models (SLM) is a good example of this.

Final thoughts

Always prioritize knowing your data – this can sometimes be boring, but it is always worth your time.

It is the machine learning version of "work smarter, not harder".


Sources and Further Reading:

[1] Cramer, J. S, The origins of logistic regression (2002), Tinbergen Institute.

[2] Ho, Tin Kam, Random decision forests (1995), In Proceedings of 3rd international conference on document analysis and recognition, vol. 1, pp. 278–282. IEEE.

[3] McCulloch, Warren S., and Walter Pitts, A logical calculus of the ideas immanent in nervous activity (1943), The Bulletin of Mathematical Biophysics 5: 115–133.

[4] Li, Ziyao, Kolmogorov-arnold networks are radial basis function networks (2024), arXiv preprint arXiv:2405.06721.

[5] Van der Maaten, Laurens, and Geoffrey Hinton, Visualizing data using t-SNE (2008), Journal of machine learning research 9, no. 11.

[6] McInnes, Leland, John Healy, and James Melville, Umap: Uniform manifold approximation and projection for dimension reduction (2018), arXiv preprint arXiv:1802.03426.

[7] Xanthopoulos, Petros, Panos M. Pardalos, Theodore B. Trafalis, Petros Xanthopoulos, Panos M. Pardalos, and Theodore B. Trafalis, Linear discriminant analysis (2013), Robust data mining: 27–33.

[8] Mika, Sebastian, Bernhard Schölkopf, Alex Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch, Kernel PCA and de-noising in feature spaces (1998), Advances in neural information processing systems 11.

[9] Lee, Te-Won, and Te-Won Lee. Independent component analysis (1998). Springer US.

Tags: AI Data Science Deep Learning Hands On Tutorials Machine Learning

Comment