How I Use ChatGPT As A Data Scientist

Author:Murphy  |  View: 27467  |  Time: 2025-03-22 21:27:51

The whole world has heard of ChatGPT, yet according to a recent news study, only 2% of people use it daily.

I use it every day as a data scientist for so many things. It has honestly 2x my productivity in some areas.

In this article, I want to explain how I use ChatGPT, and hopefully, it can give you some new ideas on how to use it more regularly to improve your output, regardless if you are a data scientist or not.

Studying & Learning

One of my favourite prompts is the "explain like I'm 5" (ELI5). It basically makes ChatGPT explain a topic to me in such a simple way, often better than many online resources.

For example, if we run "explain to me recurrent neural networks like I'm 5 years old."

Example output from ChatGPT.

The analogy is excellent and really builds the intuition behind recurrent neural networks.

We can then start digging deeper and ask more specifically about mathematics, worked examples, etc. as I need a more practical and in-depth understanding of implementing RNNs as a data scientist.

The principal value here is that I use ChatGPT like a personal tutor who can explain things to me in so many different ways that I can choose. This is really powerful and may save you a lot of time searching the internet for that one explanation that clicks for you. Try it out and let me know how you get on!

Unit Tests

It's no secret that most tech professionals don't like writing unit tests. They are very tedious but need to be done as they ensure your code is robust and build confidence when deploying it to production.

ChatGPT can greatly expedite this process by providing a testing template that you can tweak to test your code's edge cases.

Let's consider a function that calculates the Fibonacci sequence:

def fibonacci_sequence(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]

    sequence = [0, 1]

    for i in range(2, n):
        next_fibonacci_number = sequence[-1] + sequence[-2]
        sequence.append(next_fibonacci_number)

    return sequence

Let's input this into ChatGPT, asking it to write unit tests for our function:

Asking ChatGPT to write unit tests for us.

This is the full output for the tests generated from ChatGPT:

import unittest

def fibonacci_sequence(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]

    sequence = [0, 1]

    for i in range(2, n):
        next_fibonacci_number = sequence[-1] + sequence[-2]
        sequence.append(next_fibonacci_number)

    return sequence

class TestFibonacciSequence(unittest.TestCase):
    def test_negative_input(self):
        self.assertEqual(fibonacci_sequence(-1), [])

    def test_zero_input(self):
        self.assertEqual(fibonacci_sequence(0), [])

    def test_one_input(self):
        self.assertEqual(fibonacci_sequence(1), [0])

    def test_two_input(self):
        self.assertEqual(fibonacci_sequence(2), [0, 1])

    def test_five_input(self):
        self.assertEqual(fibonacci_sequence(5), [0, 1, 1, 2, 3])

    def test_ten_input(self):
        self.assertEqual(fibonacci_sequence(10), [0, 1, 1, 2, 3, 5, 8, 13, 21, 34])

if __name__ == '__main__':
    unittest.main()

I can confirm this works after running it in my PyCharm IDE.

PyCharm terminal output.

I can't tell you how much time this saves me. Just setting up the test alone takes a good proportion of time. This approach lets me focus on the essential parts: finding those edge cases that may break my function.

If you want to learn more about unit tests, check out my previous post.

Pytest Tutorial: An Introduction To Unit Testing

Creating Visualisations

Creating plots is often quite lengthy, particularly if you are adding multiple lines and labels and trying to make the graph look nice overall. I have spent hours smashing my keyboard, getting Matplotlib to give me what I want. It's not fun, and I don't recommend it.

Before GPT-4 and GPT-4o, I would use ChatGPT-3.5 to generate the Python code to plot my graph by running it in an IDE. However, now it's even easier, as they have something called Advanced Analytics. You literally hand it your data, and it creates the plot for you and outputs it to the screen with the associated code.

For example, let's use the data I got from Kaggle, which contains Netflix Movies and TV Shows (CC0 licence). All I do is drag into ChatGPT, and I can ask it to explain the data for me:

The input data
Output from ChatGPT.

It also provides the Python code it used to load the data, so I can confirm what it is doing under the hood. However, in this case, it has identified the data quite well, so I don't need to dig in further.

I can now ask it to plot something using this data. Let's say we want a bar chart of show_id by the country in which it was filmed.

Input to create the plot.

It writes the code out for us, then generates the plot in the chat window!

Note that the plot in the window will look different from the one the code will generate in matplotlib.

Show_id by country bar chart from ChatGPT.

As we can see, its pretty messy because it has plotted every single country. Let's just say we want the top 10, and the rest in some "others" category.

Updated prompt and code output from ChatGPT.
Show_id by country bar chart showing the top 10 and the rest in "others."

This looks much better, and the plot is of high quality, too!

This is a straightforward example, but it shows you how powerful ChatGPT can be for data analysis and visuals. You also don't even need to be able to code to do this; it helps, but it is certainly not a requirement, which makes it really accessible to a wide audience.

Refactoring Code

Sometimes, I am lazy when I write code, and it's not as lovely and concise as it can be. This can lead to bugs and make it difficult to understand what's going on (a nightmare for most developers).

I usually ask ChatGPT to make my code "production standard," a prompt I found works well to clean it up and make it look more professional and overall nicer.

For example, consider this code I wrote in one of my previous posts that makes a time series stationary:

# Import packages
import plotly.express as px
import pandas as pd
import os
import numpy as np

# Read in the data
data = pd.read_csv('../../Software Engineering /make-example/AirPassengers.csv')

def plotting(title, data, x, y, save_file_path, x_label, y_label):
    """General function to plot the passenger data."""
    fig = px.line(data, x=data[x], y=data[y], labels={x: x_label, y: y_label})

    fig.update_layout(template="simple_white", font=dict(size=18),
                      title_text=title, width=650,
                      title_x=0.5, height=400)

    if not os.path.exists("../images"):
        os.mkdir("../images")

    fig.write_image("../images/" + str(save_file_path))

    fig.show()

# Plot the airline passenger data
plotting(title='Airline Passengers', data=data, save_file_path='passengers.png', x='Month',
         y='#Passengers', x_label='Date', y_label='Passengers')

# Take the difference and plot it
data["Passenger_Diff"] = data["#Passengers"].diff()

plotting(title='Airline Passengers', data=data,
         save_file_path='passengers_one_difference.png', x='Month', y='Passenger_Diff',
         x_label='Date', y_label='Passengers
Difference Transform') # Take the log and plot it data["Passenger_Log"] = np.log(data["#Passengers"]) plotting(title='Airline Passengers', data=data, save_file_path='passenger_log.png', x='Month', y='Passenger_Log', x_label='Date', y_label='Passenger
Log Transform') # Take the difference and log and plot it data["Passenger_Diff_Log"] = data["Passenger_Log"].diff() plotting(title='Airline Passengers', data=data, save_file_path='passenger_difference_and_log.png', x='Month', y='Passenger_Diff_Log', x_label='Date', y_label='Passenger
Log and Difference')

Let's input it into ChatGPT and ask it to make it "production standard" and "tidier."

The code it generated looks like this:

# Import packages
import plotly.express as px
import pandas as pd
import os
import numpy as np

# Read in the data
data = pd.read_csv('../../Software Engineering /make-example/AirPassengers.csv')

def plot_data(title, data, x, y, save_file_path, x_label, y_label):
    """
    General function to plot the passenger data.

    Parameters:
    - title: Title of the plot
    - data: DataFrame containing the data
    - x: Column name for the x-axis
    - y: Column name for the y-axis
    - save_file_path: File path to save the image
    - x_label: Label for the x-axis
    - y_label: Label for the y-axis
    """
    fig = px.line(data, x=x, y=y, labels={x: x_label, y: y_label})

    fig.update_layout(
        template="simple_white", 
        font=dict(size=18),
        title_text=title, 
        width=650,
        title_x=0.5, 
        height=400
    )

    output_dir = "../images"
    os.makedirs(output_dir, exist_ok=True)
    fig.write_image(os.path.join(output_dir, save_file_path))
    fig.show()

# Plot the airline passenger data
plot_data(
    title='Airline Passengers', 
    data=data, 
    x='Month', 
    y='#Passengers', 
    save_file_path='passengers.png', 
    x_label='Date', 
    y_label='Passengers'
)

# Create transformed columns and plot
transforms = [
    ("Passenger_Diff", data["#Passengers"].diff(), 'passengers_one_difference.png', 'Passengers
Difference Transform'), ("Passenger_Log", np.log(data["#Passengers"]), 'passenger_log.png', 'Passenger
Log Transform'), ("Passenger_Diff_Log", np.log(data["#Passengers"]).diff(), 'passenger_difference_and_log.png', 'Passenger
Log and Difference') ] for col_name, transform, file_name, y_label in transforms: data[col_name] = transform plot_data( title='Airline Passengers', data=data, x='Month', y=col_name, save_file_path=file_name, x_label='Date', y_label=y_label )

As you can see, the formatting agrees with Pep8 guidance, we have doc strings, and it has created a function for repeated bits of code adhering to the DRY principle. Overall a good job!

Python does have proper formatters out there like isort and black, but I find GPT often does a quicker and better job off the bat.

If you want to learn more about code quality, check out my previous post.

A Data Scientist's Guide To Improving Python Code Quality

Summary & Further Thoughts

Using ChatGPT has boosted my productivity in many areas, such as learning new things, writing unit tests, doing analysis and refactoring code. I hope this article gave you some ideas to try in your own field of work. While I think it won't replace us, it is a powerful tool that you should try to integrate into your workflow as much as possible.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE Data Science resume!

Dishing The Data | Egor Howell | Substack

Connect With Me!

Tags: Artificial Intelligence ChatGPT Data Science Machine Learning Neural Networks

Comment