How to Test Graph Quality to Improve Graph Machine Learning Performance

Author:Murphy | View: 26975 | Time: 2025-03-22 22:42:32

. This article will show you how to test the quality of your topological graphs

Graphs are data structures capable of representing a large amount of information. In addition to representing data samples individually as nodes, a graph also represents the relationship between the data, encapsulating more of the information stored in your dataset. When creating a graph, however, it is important to verify the quality of the graph, which is what I will discuss how you can do in this article.

Learn how to ensure graph quality with this article. Image by ChatGPT. "make a graph with some nodes being looked at by a magnifying glass" prompt. *ChatGPT*, 4, OpenAI, 25 Feb. 2024. https://chat.openai.com.

Motivation

The motivation for this article is that I am creating graphs for a project I am working on. The graphs are later in my pipeline used to perform clustering as seen in the pipeline image below. To ensure the correctness of my graph, I want to have a test that can output the quality of each graph I create. When working on machine-learning projects, verifying your results and quality is vital for both saving time bug fixing and ensuring that your data pipeline is working correctly. The verification result can work as a sanity check, so you are sure the graph is not the issue if your machine-learning algorithm is not performing as expected.

The pipeline for my machine learning project. Image by the author.

Furthermore, I also want to reduce the scope of what I will be talking about. First of all, when referring to a graph, I mean a graph structure purely defined by its topological structure, meaning I am only referring to the relationship between the data. A graph purely defined by its topological structure, can then be represented with 2 lists. One list of all node indices, and one list of all edges (which could also include edge weights), a 2D list with each row (source, destination, weight). If your graph is weighted, you can ignore the weight, or set all weights to 1. Secondly, a scope definition I will make is that I am using my graph to separate different classes from each other, which will be reflected in the types of tests I will use on my graph. Graphs defined by topological information only can be used for several other purposes, like a shortest path problem or link prediction, which I will cover in a later article.

A graph defined by topology information only. Image by the author.

· Motivation · Table of Contents · How to create a graph ∘ Finding a dataset ∘ Converting your dataset to a graph · Test 1: Downstream tasks ∘ Use downstream tasks when ∘ Do not use downstream tasks when · Test 2: Graph metrics ∘ Use graph metrics when ∘ Do not use graph metrics when · Test 3: Visual inspection ∘ Use visual inspection when ∘ Do not use visual inspection when · Test 4: Node2Vec + Clustering ∘ Use Node2Vec + Clustering when ∘ Do not use Node2Vec + Clustering when · What makes a good graph? · Conclusion

How to create a graph

First, you need a graph you can test on. To do this, you need a dataset to make a graph out of, and if the dataset is not already a graph in itself, you need to make a graph from the data.

Finding a dataset

Kaggle has a lot of free datasets you can use, which people create and upload to the website. On Kaggle you can find both premade graph datasets, and datasets you can create a graph out of yourself, like the famous MNIST dataset for example (this is not the original source of the MNIST dataset, but it shows that Kaggle contains all sorts of different datasets freely available for you to download). Another option is Stanford University's data website, which has a lot of quality datasets, for example, a Twitter Social graph. You can also use sites such as PyTorch, HuggingFace, PapersWithCode, and GitHub to find open-source datasets. To learn more about creating data, you can read my article on creating powerful embeddings below:

How to Create Powerful Embeddings from Your Data to Feed into Your AI

Converting your dataset to a graph

After you have obtained your dataset, you then have to convert it to a graph defined by topological information. To do this, you first need to define different data samples. The data samples can be images like in MNIST, or an item from a supermarket defined by different attributes like price, producer, and name.

When you have your data samples, you can then convert them to a graph with a few methods. One is to calculate the similarity between vectors with cosine similarity. The similarity between two samples will then represent an edge between the samples, with the weight of the edge being the similarity. Additionally, you can add thresholding here, so you only add an edge if it is above a certain threshold. For example, two data samples have to have at least a 0.9 similarity before you create an edge between them. You could also use percentile instead, by for example saying only the top 20% of the most similar pairs of data samples get an edge between them. An important note here, however, is that the edges will then be dependent on your data since you are taking a top percentage of your data, while thresholding will be independent of your data.

# code to create a graph based on a matrix of embeddings
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
num_samples = 10
num_features = 10
embeddings = np.random.rand(num_samples, num_features) #10 data samples, each with a length of 10
similarity_matrix = cosine_similarity(embeddings)
edges = []
weights = []

# get top 50 percentile
threshold = np.percentile(similarity_matrix, 50)
for i in range(len(similarity_matrix)):
 for j in range(i+1, len(similarity_matrix)): #we only need the upper triangle of the matrix
  if similarity_matrix[i][j] >= threshold:
   edges.append((i, j)) #assuming we are looking at an undirectional graph
   weights.append(similarity_matrix[i,j])

# make a networkx graph with the edges and weights
G = nx.Graph()
G.add_nodes_from(range(num_samples))
for (source, destination), weight in zip(edges, weights):
 G.add_edge(source, destination, weight=weight)
nx.draw(G)

Which will draw a graph like below:

A graph made by using the 20% most similar nodes. Note that since we are only using the top 50%, the graph is not very connected. Image by the author.

Lastly, you could also make edges by applying KMeans clustering, though for my use case, this is not recommended, as the goal of my graph is to apply clustering to it, and I therefore do not want my graph to be dependent on another clustering method.

Cosine similarity formula. Image by the author.

Test 1: Downstream tasks

The first test you can use to test the quality of your graph is by applying downstream tasks and calculating the performance with different metrics. For example, you can use a community detection downstream task, with metrics like AMI, NMI, and Rand Index. The downstream task you can use will depend on what you are using the graph for, a topic I went more in-depth into in another Towards Data Science Article. Since I am using my graph to separate different classes from each other, my downstream task will be community detection (clustering applied on a graph defined by topological information).

I also took further information about my dataset into account when choosing a downstream task. Two main properties are defining the dataset I am using:

It is weighted. Meaning I need a community detection algorithm utilizing weighted edges
I have a specific number of communities to detect. My community detection algorithm therefore needs to have an option to set the exact number of communities to detect.

A perfect candidate for my requirements was then the Spectral Clustering approach, which Scikit-Learn has code for. I could then use the following code to detect communities in my graph:

# code to perform spectral clustering
import numpy as np
from sklearn.cluster import SpectralClustering
n_communities = 3
num_nodes = 10  # Number of nodes in your graph
similarity_matrix = np.random.rand(num_nodes, num_nodes)
# Since the similarity matrix should be symmetric, we symmetrize it
similarity_matrix = (similarity_matrix + similarity_matrix.T) / 2
np.fill_diagonal(similarity_matrix, 0)

sc = SpectralClustering(n_clusters=n_communities, affinity='precomputed', assign_labels='kmeans', random_state=42)
labels = sc.fit_predict(similarity_matrix)

Which you can plot to a graph with:

# plot the graph with assigned labels
G = nx.Graph()
G.add_nodes_from(range(num_nodes))

# if you have your own edges, you can ignore the for loops here
edges = []
weights = []
for i in range(len(similarity_matrix)):
 for j in range(i+1, len(similarity_matrix)): #we only need the upper triangle of the matrix
   edges.append((i, j)) #assuming we are looking at an undirectional graph
   weights.append(similarity_matrix[i,j])

# add edges to graph
for (source, destination), weight in zip(edges, weights):
 G.add_edge(source, destination, weight=weight)
# draw graph
pos = nx.spring_layout(G, k=1)
nx.draw(G, pos, with_labels=True, node_color=labels)

Which will plot a graph like this:

Graph clustered with spectral clustering. You can see the different communities defined with different colors. Image by the author.

You then have to find some metrics you can use to measure how the downstream task is performed. Since I am using community detection, natural metrics to use are AMI, NMI, and the Rand Index, to get an understanding of how well the community detection performed. The quality of your graph can then be interpreted by the metrics given. You should note here however that the metrics for a single graph do not necessarily mean that much, but you can instead compare the metrics across different graphs to see which of your graphs performs best.

Note that if you are using community detection, you should make sure your graph is connected. That is you only have one island of nodes, so for all node pairs, there exists a path between them. If a graph is not connected, the community detection will often perform poorly.

Use downstream tasks when

When you have an appropriate downstream task to test on
When you want a relative measure of graph quality

Do not use downstream tasks when

You want an absolute measure of graph quality

Test 2: Graph metrics

Another test you can use for the quality of your graph is looking into different graph metrics. I have previously written two articles on analyzing graph networks, an article on basic graph analysis article, and an article with more advanced analysis. In general, you can use different metrics such as degree, or connectivity to gain an understanding of your graph. How these metrics correspond to the quality of your graph will then depend on what your graph is being used for. If you are using your graph for community detection, for example, high connectivity might be a desired trait.

Furthermore, I also recommend creating some metrics yourself which you can use to understand your graph. One suggestion here when you have node labels, is to calculate the percentage of nodes that are most connected to nodes of the same label. A higher percentage for this metric will then represent a better graph, since the metric shows that more similar elements are connected.

You can also create your own metrics depending on the use cases you have for your graph. The advantage of this is that you can customize the metric to fit your specific needs, which can create a metric that strongly correlates with the quality of your graph. Creating such a metric, will in turn allow you to quickly gain a better understanding of which graphs are better than others, which can save you a lot of time and effort.

Examples of other metrics you use are:

The average edge weight in the graph
The number of isolated nodes in the graph
The number of node islands in the graph

An example graph with two node islands. Island 1 consists of the graphs (0, 1, 4) and Island 2 consists of nodes (2,3). Image by the author.

Use graph metrics when

You can find a fitting metric to represent the quality of your graph

Do not use graph metrics when

Graph metrics cannot give meaningful information to interpret the quality of your graph

Test 3: Visual inspection

Visually inspecting your data is vital in machine learning to understand the dataset you are working with. The same logic applies to graphs. To understand the quality of your graph, you should use visual inspection, with tools like Graphviz, Gephi, or Cytoscape. You can use these tools to gain a better understanding of the strengths and weaknesses of your graph.

In the image, you can see the previous graph visualized in Cytoscape, with node colors defined by their label. Nodes (0,1,4) with the same label, and nodes (2,3) with the same labels. Image by the author.

When first opening up a graph visualization tool, you can feel overwhelmed by the number of options, as well as the sometimes unintuitive design used. If you manage to fully understand the ins and outs of the tools, however, it can give you a powerful advantage. I am using Cytoscape myself. To visualize a graph in Cytoscape, I typically have a graph defined in Python with NetworkX, and then write to the .graphml file format. This then allows me to open the graph in Cytoscape.

Some important options I use in Cytoscape are:

Analyze network. Analyzing the network is an option under the tools option in the top bar. The analyze network option gives you all sorts of information about your graph, like the connectivity and degree.
Graph layout. Choosing a layout for the graph, by going to the layout menu in the top bar, and trying out the different layouts, to see which one gives the best visualization
Selecting neighbor nodes. You can select neighbor nodes in the graph by right-clicking a node, pressing the Select option, and then the Select First Neighbors option. This will highlight all neighbor nodes in the graph, which is useful to understand what kind of nodes are connected.
Coloring node information. To further gain an understanding of what nodes are connected, you can color each node by a node attribute, for example, the label of a node. You can do this by going to the Style option in the sidebar, selecting the Node pane, and then choosing the Fill color option. This allows you to assign a color to nodes depending on an attribute of your choosing.

Utilizing some of the options mentioned above, and testing out different options yourself in Cytoscape, or any other graph visualization tool of your choosing, is a powerful way to get a deeper understanding of your graph. In turn, this is a good way to test the quality of your graph, as you can realize certain problems with your graph, that you can fix, to further improve the quality of your graph.

Selecting the first neighbors of the node with index 1 in Cytoscape. You can see that nodes (0,4) which are the immediate neighbor nodes of 1 are selected. For more complex graphs, this can be used to gain an understanding of the connectivity in the graph using Cytoscape. Image by the author.

The reason visual inspection works well is that you gain an intuition for what the graph looks like, and you can suddenly detect glaring weaknesses in the graph that you might not have been aware of before. An example of this is by looking into the neighbor nodes of each node, you might see that those nodes should not be neighbors at all (this can for example happen if you mix up similarity and distance when calculating edges, as explained earlier in this article). Another issue you might detect in the graph is the amount of isolated nodes, or node islands, which are nodes that are connected, but not to the main island of nodes

Use visual inspection when

You want to understand the quality of a single graph

Do not use visual inspection when

You need quantitative measures for graph quality

Test 4: Node2Vec + Clustering

A third option for testing the quality of your topological information-based graph is by using an algorithm like Node2Vec to convert it from a topological information-based graph to node embeddings. You can then classical clustering techniques like KMeans or DBScan to the node embeddings and measure the quality of your clustering with metrics like AMI, NMI, or the Rand Index.

Converting from a graph defined by topological information to node embeddings is a popular topic in machine learning, with random walk-based methods like DeepWalk and Node2Vec being among the most used methods. Converting the graph into node embeddings, allows you to apply typically embeddings-based quality tests, which are described in more detail in my article on embedding quality. You can see the Python code for making a Node2Vec embedding for a NetworkX graph below:

# get node2vec embeddings for a graph
!pip install node2vec

from node2vec import Node2Vec
# Generate walks
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)
model = node2vec.fit(window=10, min_count=1, batch_words=4)
embeddings = model.wv.vectors

When using this test however, you should be aware of the fact that the metrics output from clustering applied to the node embeddings will be reliant on the quality of your algorithm converting from a topological graph to node embeddings. This is a significant weakness, which you should take into account when interpreting the output results. Still, this test can be an additional option you can use when testing your topological information-based graph.

An example of two random walks, highlighted with green and purple arrows. Both random walks start at node F and go to node C. The green random walk then goes to node B, and then back to node C, while the purple random walk goes to node D, and then to node E. This is the concept Node2Vec is based on, and how it creates node embeddings. Image by the author.

Use Node2Vec + Clustering when

You want a test to further understand the quality of your graph

Do not use Node2Vec + Clustering when

You need a measure not influenced by other factors (like the Node2Vec algorithm in this case)

What makes a good graph?

What a good graph is will depend on what the graph is used for. In absolute terms, a good graph can be defined as a graph that performs well at the job it is designed to do. This sounds vague, but it highlights an important point of testing the quality of your graph. The graph quality tests should reflect the tasks the graph is performing. Sometimes, creating a test that correlates 100% with the final task, is impossible, often due to the lack of available supervised data. Creating other tests like the ones described above can therefore be the way to go. You should note, however, that when using tests like the above, no test can be understood as a definitive measure of graph quality. A graph might perform well on a test, due to other reasons than purely the quality of the graph, which illustrates the point that these tests should be used in combination with each other, and also used to compare different graphs to each other. If a specific graph outperforms other graphs on all tests, you can interpret this as a positive sign your graph is of better quality, though you should always take the results with a grain of salt. In the end, the tests are made to gain a better understanding of the graph quality, and they can unfortunately not give you a definitive answer.

A high-quality graph according to ChatGPT. "can you make an image of a high-quality topological graph" prompt. *ChatGPT*, 4, OpenAI, 25 Feb. 2024. https://chat.openai.com.

Conclusion

In this article, I have shown you how you can find graph datasets online, and if the data is not already in a graph format, how you can convert it to a topological graph with methods like thresholding or using percentiles. I then described how you can gain an understanding of the graph with the following tests: