Top 30 GitHub Python Projects At The Beginning Of 2024

Happy New Year 2024! As the first post in the new year, just like what I did before, I'm very curious about what were the most popular Python projects so far. GitHub is definitely the most suitable place to have these statistics. Although not all the open-sourced projects will be maintained here, there won't be any other single places that are better than here.
This rank is meant to be easy because I'll share my code. Now, let's have a look at how we can get the ranked list from GitHub API with a few lines of code. After that, I'll categorize these projects using my terminology and then add some short introductions to them.
The Top 30 GitHub projects are categorized as follows:
- 2 repositories: Machine Learning Frameworks
- 3 repositories: AI-driven Applications
- 8 repositories: Software Development Frameworks
- 2 repositories: Development Productivity Tools
- 3 repositories: Useful Information Catalog
- 8 repositories: Educative Content
- 4 repositories: Real-World Applications
GitHub Search API

The official API documentation can be found on this page:
So, I won't repeat the details of the API Endpoint such as the parameters in this article. If you are interested in what else we can do with the GitHub API, please refer to that page.
The most beautiful thing is that we don't need to register or apply for an API key to use this endpoint. Of course, it has a rate limit which is up to 10 requests per minute, but it is not a problem for us as we just want to get several top-ranked repos. A few times of API calls will be enough for us to debug.
First of all, we need to use the requests
module of Python. It is built-in and I believe most of you should be familiar with it. Then, we need Pandas to do some transformation of the data.
import requests
import pandas as pd
The API endpoint is https://api.github.com/search/repositories
based on the API documentation. Since we are only interested in Python-based projects, we need to put the argument language:python
in the query. It seems that the API behavior has changed slightly over the last year. That is, it has to take at least one search term (string) in the query. Otherwise, the results are unpredictable. Therefore, let's put "python" as the only term in the query. The rationality is that, if the repo is regarding Python, it should mention "python" in the Readme.md or use it as a tag of the repo anyway.
Then, we want to sort the search results by the number of stars and order by descent. That means our search string is as follows.

The full URL for this GET request is as follows.
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars&order=desc'
Then, we can use the requests
module to call this API endpoint. We should use the GET method. Then, we can convert the results to a Python dictionary.
res = requests.get(url)
res_dict = res.json()
All the search results will be in an array with the key "items". So, we can get all the repo information as follows.
Now, let's get all the "items" from the search result into an array repos
.
repos = res_dict['items']
len(repos)

There is some other information in the result dictionary. If we remove the item array, the only two keys left are total_count
and incomplete_results
. The former indicates how many repos retrieved with our query. As shown in the screenshot below, there are a total of 2,219,756 of them. That is not surprising at all, since we are searching for all Python repos in GitHub.

The incomplete_results
indicates that there are more repos and this JSON payload is just a page.
Now, let's convert the items array into Pandas dataframe.
repo_df = pd.DataFrame(repos)
Then, I want to remove all the columns that are not interested in our context, since we only want to know the name of the repo and the number of stars. I'll also add one more column called year_on_github
to catch how many years this project has been created on GitHub.
repo_df = repo_df[['name', 'full_name', 'html_url', 'created_at', 'stargazers_count', 'watchers', 'forks', 'open_issues']]
repo_df['created_at'] = pd.to_datetime(repo_df['created_at'])
repo_df['created_year'] = repo_df['created_at'].dt.year
repo_df['years_on_github'] = 2023 - repo_df['created_at'].dt.year
Here is the full list of the Top 30 Repos:

Now, let's look into these repos. What are they? How can they help?
Machine Learning Frameworks

Machine Learning Frameworks refers to those essential tools and libraries for developing and training machine learning models. They are used by Data Scientists, Machine Learning Engineers, and Researchers day by day.
This year, there are 2 of them listed in the Top 30.
1. PyTorch (7th ranked, 74k stars)

PyTorch is one of the most popular Machine Learning Frameworks that was developed by Facebook's AI Research lab. It is very commonly used in any deep-learning workload. Compared to other popular frameworks such as TensorFlow, it is generally more flexible and easier to use.
One of another strengths of PyTorch is its GPU acceleration which could significantly reduce the training elapse time for large models.
Perhaps you want to know why TensorFlow is not listed here. Indeed, it has even stars. However, the source code of TensorFlow is written in C++ but not Python, so it is out of the scope of this survey.
2. Scikit-Learn (14th ranked, 57k stars)

Sci-kit Learn, which is commonly known as sklearn
, is famous for its foundational capabilities such as classification, regression, clustering, and dimensionality reduction. It is primarily used in classical machine learning scenarios. It is highly recommended that the newbies should learn the classical algorithms before diving into deep learning, in order to have a consolidated fundamental understanding about this domain.
In my opinion, "traditional" doesn't necessarily mean "outdated". In some ML applications that require higher reproducibility and explainability, they are still indispensable.
AI-Driven Applications

These repos are innovative projects that make use of recent AI breakthroughs. If you tell me any one of these repos and what they could do 2 years ago, I would probably laugh at these sci-fi jokes, but now they come true.
1. Real-Time-Voice-Cloning (18th ranked, 49k stars)
This project is an implementation of the SV2TTS (Speaker Verification to Multispeaker Text-To-Speech Synthesis) model with a real-time vocoder, originally developed by the author as a master's thesis (really impressive!).
It first collects a few seconds of human voice to create the digital representation. Then, it will generate speech from any given text very quickly. The speech can even be generated in real time.
2. gpt-engineer (19th ranked, 48k stars)
This impressive project is the only one in the Top 30 that was created within 1 year. That means, only a few months past it makes itself among the top tier!
GPT-Engineer allows users to specify what they want to do with the program in natural language. Then, it leverages AI to build the software gradually by understanding the problem space and asking follow-up questions to clarify the requirements. Therefore, it streamlines almost the entire lifecycle of software development without any actual coding. Although it can't build comprehensive software right now, it can still impressively build a relatively simple one. Maybe you can try to divide your software into modules and try to use this tool