Python Meets Pawn 2: Clustering Chess Grandmasters based on their Openings

Author:Murphy  |  View: 25322  |  Time: 2025-03-22 23:37:58
photo created by Midjourney

What questions am I answering

My passion for Chess is no secret, and here, I've shared analyses of my own game openings. But today, I venture into a new territory: the world of Grandmasters. What openings do they commonly use? How varied are their choices? I'm interested in the distribution of these openings among different Grandmasters. Do top players favor similar openings? Is it possible to group them based on their preferences? I do not know – let's find out!


Part 1: Getting data

A great aspect of chess is the accessibility of its data. There are many sources, including pgnmentor, where you can view and download data about openings and players (for free). This data, updated several times a year, includes games in Portable Game Notation (PGN), the most popular format for chess games. Since downloads are individual, I chose 11 well-known Grandmasters to download and analyze their openings. Please note, this list is subjective and includes some of my favorite Grandmasters:

  1. Shakhriyar Mamedyarov
  2. Teimour Radjabov
  3. Hikaru Nakamura
  4. Magnus Carlsen
  5. Fabiano Caruana
  6. Ding Liren
  7. Ian Nepomniachtchi
  8. Viswanathan Anand
  9. Anish Giri
  10. Vugar Gashimov
  11. Vladimir Kramnik

The complete code will be provided at the end of the blog. For parsing PGN files, I utilized the PGN module from the Python library called ‘Chess'.

The function that I have used for parsing data looks like this:

def parse_pgn_file(file_path):
    """
    Parses a PGN (Portable Game Notation) file containing chess games.

    Args:
        file_path (str): Path to the PGN file.

    Returns:
        pd.DataFrame: A DataFrame containing game information.
    """
    games = []  # Initialize an empty list to store parsed games.
    with open(file_path, "r") as pgn_file:
        while True:
            game = chess.pgn.read_game(pgn_file)  # Read a game from the PGN file.
            if game is None:
                break  # Exit the loop when no more games are found.
            games.append(game)  # Append the parsed game to the list.

    data = []  # Initialize an empty list to store game data.
    for game in games:
        data.append({
            "Event": game.headers.get("Event", ""),
            "Date": game.headers.get("Date", ""),
            "Result": game.headers.get("Result", ""),
            "White": game.headers.get("White", ""),
            "Black": game.headers.get("Black", ""),
            "Moves": " ".join(str(move) for move in game.mainline_moves()),
            "ECO": game.headers.get("ECO", "")
        })  # Extract relevant information from game headers and moves.

    df = pd.DataFrame(data)  # Create a DataFrame from the extracted data.
    return df  # Return the DataFrame containing game information.

Below is the table of how the table of my parsed and combined data appears. I will utilize the existing "ECO" column, indicating the opening played in each game. The ECO code in chess refers to the "Encyclopaedia of Chess Openings," a classification system used to categorize the various openings in chess. Each code, consisting of a letter followed by two numbers, like B12 or E97, uniquely identifies a specific opening or variation.

Parsed dataset (Image by the author)

Grandmasters possess thousands of games featuring 484 unique combined ECO codes. Given that there are 500 unique ECO codes, these 11 grandmasters have almost utilized the entire range in their careers. However, how many unique openings has each one played? Let's examine the following graph:

Unique openings graph (Image by the author)

These numbers are highly correlated with the number of games they have in my dataset, but in general, the graph indicates that Grandmasters employ a wide variety of openings in their games.


Part 2: Feature Engineering

Let's begin by looking at the most favored openings for each Grandmaster:

  • B90 – Sicilian Defense, Najdorf variation : Anand, Giri, Nepomniachtchi
  • D37 – Queen's Gambit Declined : Carlsen, Mamedyarov, Radjabov
  • C42 – Russian Game : Gashimov, Kramnik
  • A05 – King's Indian Attack : Nakamura
  • C65 – Spanish Game, Berlin Defense : Caruana
  • E60 – Gruenfeld and Indian Game : Ding

I guess it's unsurprising to see a Russian Grandmaster favor the Russian Game. Gashimov also favored the Russian game, indicating the Soviet Chess school's strong influence in Azerbaijan. Noticing some patterns based on their favorite openings is intriguing. However, to achieve a more detailed and segregated grouping, I will apply Clustering techniques, considering a range of other openings as well.

Let's examine the distribution of openings for each Grandmaster. I pivoted the dataset with the Grandmaster as the index, using unique ECO codes for columns and the number of games as values. Below graph is the example for Magnus Carlsen:

Distribution of openings for Magnus (Image by the author)

Despite the variety of openings played by the Grandmasters, it's evident that some openings have a clear advantage over others. Most Grandmasters seem to favor about five particular openings, which influenced my decision to focus on a dataframe featuring the top 5 openings.

For clustering, I chose to test two dataframes: the pivoted proportion and the top 5 openings. The best results were achieved using the latter one, which I'll explain in detail below. For more options and detailed insights, please refer to the complete code provided at the end. In the top 5 openings dataframe, I employed one-hot encoding. Among the 11 Grandmasters, there were 24 unique ECO codes in the top 5 selections. The binary values in this dataframe indicate whether a specific ECO code is among the top 5 for each Grandmaster:

Top5 dataframe (Image by the author)

The table below shows the top 5 ECOs for each Grandmaster. We can already see some patterns, but clustering will help us distinguish them more effectively.

Top 5 openings result for each GM (Image by the author)

Part 3: Clustering

The top 5 favorite openings dataset contained 24 columns. To simplify it, I applied PCA (Principal Component Analysis). This method helps in reducing data dimensions while preserving crucial information. While the first principal component provided good results, I opted for two components. Why? They offered nearly the same insight and, importantly, made visualization easier.

For grouping grandmasters, I used K-means clustering. It's like sorting books into genres. First, I chose a number of clusters, or ‘genres'. Each grandmaster's opening style is then matched to the closest cluster, like assigning books to the most fitting genre. The process keeps adjusting: cluster centers, representing the common style of each group, are recalculated and grandmasters are reassorted accordingly. This repeats until the clusters accurately represent different playing styles. Through K-means, distinct patterns in chess openings emerged, highlighting varied strategies among the grandmasters.

Choosing the right number of clusters is key in any clustering project. For this, I used the elbow method. It's a straightforward approach to determine the ideal number of clusters for grouping data. You plot a graph where each point represents a different number of clusters and calculate the "within-cluster sum of squares" (WCSS) for each. WCSS measures how closely data points in a cluster are to the cluster center. On the graph, there's a point where increasing clusters doesn't significantly reduce WCSS. This point, resembling an elbow, indicates the best number of clusters. It ensures a balance between a manageable number of clusters and closely grouped data points. The below graph demonstrates that the optimal number is 4 in our case.

Elbow method to decide the best number of clusters (Image by the author)

With the number of clusters determined, I clustered the grandmasters. To assess the effectiveness of my clustering, I used the silhouette score. This score measures how similar an object is to its own cluster compared to other clusters. A high silhouette score indicates well-clustered data. The score ranges between -1 and 1, and I achieved a score of 0.69, indicating effective clustering.

Finally, I visualized the clustered data and the centroids (the ‘center' of each cluster) in a two-dimensional space. This step turned complex data into an easily understandable and visually appealing format, perfect for seeing patterns and differences at a glance:


Results and interesting facts

My analysis revealed that Grandmasters exhibit a broad repertoire in chess openings, yet they have certain preferences that differ among them. Clustering them based on these openings was not only feasible but also yielded intriguing insights. For instance, Azerbaijani chess legends Mamedyarov and Radjabov were grouped together. Interestingly, Anand, Giri, and Caruana were also closely clustered. A closer look at their top 5 favorite openings confirms these results. Remarkably, Anand and Giri share the exact same top 5 openings. Could this suggest Giri's admiration for Anand? Indeed, after researching on the internet, I discovered that Giri greatly admired Anand and learned from his games. Below are those openings:

  • B90 – Sicilian Defense, Najdorf variation
  • C50 – Italian game
  • C42 – Russian Game
  • C65 – Spanish Game, Berlin Defense
  • C67 – Spanish Game, Berlin Defense, other variations

Complete code with Jupyter notebook file can be found here.

Tags: Chess Chess Openings Clustering Data Analysis Python

Comment