My First Exploratory Data Analysis with ChatGPT
ChatGPT is an extraordinary tool for working more efficiently, and that doesn't stop with data analytics. In this article we'll run through an example of exploratory data analysis (EDA) run by ChatGPT. We'll cover the various stages of an EDA, see some impressive outputs (Wordclouds!) and note where ChatGPT does well (and not so well). Finally, we'll touch on the future of LLMs in analytics and how excited we are for it.
The dataset used for the analysis is a sample from Common Crawl, which is free to be accessed and analysed by anyone. The Common Crawl dataset is a vast collection of web crawl data, comprising billions of web pages from the internet. The dataset includes various web content types and is regularly updated. It serves as a significant resource for training language models like LLMs and made up 60% of the training data for ChatGPT. You can find the dataset sample curated by the author hosted on Kaggle here.
Throughout the post, content will be truncated, so feel free to follow along directly on the Google Colab used to run this analysis.
We've broken down the analysis into five sections: