Automating Chemical Entity Recognition: Creating Your ChemNER Model

Author:Murphy | View: 23917 | Time: 2025-03-23 12:03:54

I've always had a strong interest in Chemistry, and it has played a significant role in shaping both my academic and professional journey. As a data professional with a background in chemistry, I've found many ways to apply both my scientific and research skills like creativity, curiosity, patience, keen observation, and analysis to data projects. In this article, I'll walk you through the development of a simple Named Entity Recognition (NER) model that I've dubbed ChemNER. This model can identify chemical compounds within text and classify them into categories such as alkanes, alkenes, alkynes, alcohols, aldehydes, ketones, or carboxylic acids.

TL;DR

If you just want to play around with the ChemNER model and/or use the Streamlit app I made, you can access them via the links below:

HuggingFace link: https://huggingface.co/victormurcia/en_chemner

Streamlit App: ChemNER Link

Introduction

NER approaches can be generally classified into one of the following 3 categories:

Lexicon-based: Define a dictionary of classes and terms
Rule-based: Define rules the terms that correspond to each class
Machine Learning (ML) – based: Let the model learn the naming rules from a training corpus

Each of these approaches has their strengths and limitations and as always, a more complicated and sophisticated model isn't always the best approach.

In this case, the lexicon-based approach would be limiting in terms of scope since for every class of compounds we are interested in classifying we'd have to manually define ALL the compounds that fall within that category. In other words, for this approach to be all encompassing you'd need to manually enter every chemical compound for every compound class.

The ML approach could be the most powerful way to go, however, annotating a dataset can be quite laborious (spoiler alert: I'll end up training a model but I want to show the entire process for educational purposes). Instead, how about we start with some predefined naming rules?

Chemical nomenclature has a well-established and defined set of rules that allow you to readily determine what functional groups are present in a molecule. These rules have been established by the International Union of Pure and Applied Chemistry (IUPAC) and can be readily accessed via a the IUPAC Blue Book, a variety of websites, or in any Organic Chemistry textbook. For instance, hydrocarbons are compounds composed solely of Carbon and Hydrogen atoms. There are three main classes of hydrocarbons named alkanes, alkenes, and alkynes which can be readily identified based on whether they have single, double or triple bonds respectively as part of their chemical structure. Below I'm showing an example of three chemical compounds (ethane, ethene, and ethyne) depicting that.

Ethane, ethene, and ethyne. Image by author.

The important thing for us here are the endings of the names (i.e., their suffixes) since it is what will allow us to differentiate between chemical compounds. For example, alkanes are identified by the suffix of -ane, alkenes by the suffix -ene, and alkynes by the suffix -yne. Every class of chemical compounds like alcohols, ketones, aldehydes, carboxylic acids, etc. has unique naming schemes like that which will serve as the basis for this project.

Establishing the Rules

Now that we have a bit of background to understand what's going on, I'll show how a rule-based approach can be implemented in Python using Spacy. I'll start simple by just dealing with the hydrocarbons. We'll add the other classes later. To do this, we'll first load a blank English model with Spacy and add an ‘Entity Ruler' component to our pipeline:

# Load a blank English model
NLP = spacy.blank("en")

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

Next, we'll establish the rules/patterns that define each class and add them to the rules component:

Python"># Define patterns
patterns = [
    {"label": "ALKANE", "pattern": [{"TEXT": {"REGEX": ".*ane$"}}]},
    {"label": "ALKENE", "pattern": [{"TEXT": {"REGEX": ".*ene$"}}]},
    {"label": "ALKYNE", "pattern": [{"TEXT": {"REGEX": ".*yne$"}}]}
]

ruler.add_patterns(patterns)

And that's it! Now let's make some text to feed to our model and see how it does!

text = "Ethane,  propene, and butyne are all examples of hydrocarbons."

doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.start_char, ent.end_char, ent.label_)

The output of this is as follows:

Ethane 0 6 ALKANE
propene 9 16 ALKENE
butyne 22 28 ALKYNE

That's pretty good! However, there are two immediate limitations that you probably realized with this initial approach:

Plural versions of a compound will not be detected with the current regex.
Basing the classification purely on the suffixes will result on lots of incorrectly labeled entities.

Though chemical compounds are typically considered as uncountable nouns (think of words like air or music) there are still instances where the plural version can be utilized. For instance, if you were dealing with a collection of ethane molecules, someone might refer to that as a group of ethanes instead. Therefore, the first point can be easily addressed by modifying our regex to the form below:

# Define patterns
patterns = [
    {"label": "ALKANE", "pattern": [{"TEXT": {"REGEX": ".*anes?$"}}]},
    {"label": "ALKENE", "pattern": [{"TEXT": {"REGEX": ".*enes?$"}}]},
    {"label": "ALKYNE", "pattern": [{"TEXT": {"REGEX": ".*ynes?$"}}]},
]

Now both singular and plural instances will be recognized by the entity ruler. However, the second point remains. As an example, words like arcane, humane, thane, lane and mundane to name but a few, if present in the text would be incorrectly labeled as alkanes.

Though there are other rules that could be implemented to bolster this approach, they would require a fair amount of extra work. Because of that there are three approaches I'm considering to deal with our limitation:

Build a corpus to train a ML-based NER model for this application
Use Named Entity Linking (NEL) to aid in correcting any labeling mistakes made by the model output
Fine-tune a transformer model like SciBERT or PubMedBERT on a custom dataset

For this article, I'll just cover the first two approaches. However, if there is interest, I'll show how the fine-tuning process could be achieved in a future article.

Making the Dataset

There are a variety of different ways to create a corpus. A quick and easy way to generate this corpus is to have chatGPT create a set of sentences that contained compounds involving the various classes that I want to extract from text. The reason this works nicely is because this approach allows me to curate and tailor my dataset which makes the subsequent annotation process much easier. My prompt was simply:

Give me a set of 50 unique sentences each dealing with unique alkanes

And then I repeated that prompt for the other classes I was interested in (i.e., alkenes, alkynes, alcohols, ketones, aldehydes, and carboxylic acids). Since I have 7 classes, I ended up with a total of 350 sentences making up my corpus. Ideally, this corpus would be larger but it is a good enough start since I'm primarily interested in illustrating this as a proof of concept more than anything else. Plus, it is always easier to simply add more data as needed to improve performance. I saved my sentences into a document called chem_text.txt.

Screenshot of corpus made for ChemNER. Image by author

As a final step, I'll use a sentence tokenizer to split each sentence in the document.

doc = nlp(chem_text)

corpus = []

for sent in doc.sents:
    corpus.append(sent.text.strip())

Now that I have this corpus made, we need to start labeling it. There's a couple ways to do this. For instance, we can use an annotation tool like Prodigy (which is amazing and you should use it if you do any kind of NLP) or we can use the rule-based approach from earlier to help us with the initial annotation. For now, I'll use the model approach since I'm not annotating a huge dataset.

DATA = []

#iterate over the corpus again
for sentence in corpus:
    doc = nlp(sentence)

    #remember, entities needs to be a dictionary in index 1 of the list, so it needs to be an empty list
    entities = []

    #extract entities
    for ent in doc.ents:

        #appending to entities in the correct format
        entities.append([ent.start_char, ent.end_char, ent.label_])

    DATA.append([sentence, {"entities": entities}])

To include all the classes I'm interested in, the rules will need to be updated to the ones below:

# Define patterns
patterns = [
    {"label": "ALKANE", "pattern": [{"TEXT": {"REGEX": ".*anes?$"}}]},
    {"label": "ALKENE", "pattern": [{"TEXT": {"REGEX": ".*enes?$"}}]},
    {"label": "ALKYNE", "pattern": [{"TEXT": {"REGEX": ".*ynes?$"}}]},
    {"label": "ALCOHOL", "pattern": [{"TEXT": {"REGEX": ".*ols?$"}}]},
    {"label": "ALDEHYDE", "pattern": [{"TEXT": {"REGEX": ".*(al|als|aldehyde|aldehydes)$"}}]},
    {"label": "KETONE", "pattern": [{"TEXT": {"REGEX": ".*ones?$"}}]},
    {"label": "C_ACID", "pattern": [{"TEXT": {"REGEX": r"bw+icb"}}, {"TEXT": {"IN": ["acid", "acids"]}}]}
]

The result of running the rule-based approach allows us to quickly annotate our dataset as shown below.

Annotated corpus for ChemNER. Image by author.

We are almost ready to split our corpus into training and test sets, however, we need to verify the quality of our annotations before moving forward. Upon, inspecting my dataset, I noticed that we ran into the mislabeling issue I alluded to earlier. Words like "essential", "crystals", "potential", "materials" amongst several others were found in the dataset and were labeled as aldehydes which highlights the limitation of the rule-based approach. I manually removed these labels using the method below and reprocessed the annotations on the corpus:

# List of words to be ignored
ignore_set = {"essential", "crystals", "potential","materials","bioorthogonal","terminal","chemicals",
              "spiral","natural","positional","structural","special","yne","chemical","positional",
              "terminal","hormone","functional","animal","agricultural","typical","floral","pharmaceuticals",
              "medical","central","recreational"}  # Convert ignore list to set

DATA = []

# Iterate over the corpus
for sentence in corpus:
    doc = nlp(sentence)

    entities = []

    # Extract entities
    for ent in doc.ents:
        # Check if entity is not in the ignore set
        if ent.text.lower() not in ignore_set:
            # Appending to entities in the correct format
            entities.append([ent.start_char, ent.end_char, ent.label_])

    DATA.append([sentence, {"entities": entities}])

Now we are ready to create our training and test sets. This can be easily done with the train_test_split function from scikit-learn. I used the standard 80:20 train:test split.

# Split the data
train_data, valid_data = train_test_split(DATA, test_size=0.2, random_state=42)

Training the Model

We have our training data ready and we can go ahead and start training our model. To train the model, I used the default Spacy NER training parameters like an Adam optimizer and a 0.001 learning rate. The training took just over an hour on a CPU in Google Colab which could be greatly reduced if using a GPU instead. The results of the training are shown below:

Training results of ChemNER model. Image by author.

The plots above show the F1 score, Precision, Recall, and Overall Score of this model tended to increase over the course of training which is good. The NER loss which corresponds to the loss of the NER component overall tended to a minimum. The ultimate performance score of the model is 0.97 which seems promising.

The Tok2Vec loss however noticeably spiked at around Epoch 300 which could be a result of too high of learning rate, vanishing/exploding gradients causing numerical instabilities, or overfitting issues amongst others. The Tok2Vec loss represents the effectiveness of the token-to-vector part of the model responsible of converting tokens to vectors. There are a variety of ways to handle this if we so choose, but for now, I'll carry on.

Testing the Model

Let's start by doing a simple test. I'll feed it a few sentences and see how well it classifies them. You can see the result below:

Initial test of ChemNER model. Image by author.

Nice! It extracted all the relevant entities AND it labeled them all correctly! That's the cool thing about the ML approach. Instead of us having to explicitly write the rules, the algorithm will learn it over the course of training. As cool as this is however, let's now put the model onto a bit more stress.

Querying Wikipedia (stress testing)

I want to stress test my model a bit more, therefore, I figured that a quick and easy way to do this is by feeding the model an entire Wikipedia article and see how it performs. I'll write a quick routine to accomplish this via the wikipedia-api package in Python:

import wikipediaapi

# Define your user agent
user_agent = "MyApp/1.0 (your@email)"

# Initialize Wikipedia API and spaCy
wiki_wiki = wikipediaapi.Wikipedia(user_agent,'en')

# Function to get Wikipedia article
def get_wikipedia_article(page_title):
    page = wiki_wiki.page(page_title)
    return page.text if page.exists() else None

# Function to perform NER on text
def perform_ner(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

With that, I will now look for the Wikipedia article on Benzene:

# Query Wikipedia for an article
article_title = "Benzene"  # Replace with your desired article title
article_content = get_wikipedia_article(article_title)

And the result of this produces:

Screenshot of query for Benzene Wikipedia article.

Neat! Now that we've verified the querying works let's run the ChemNER model. The ChemNER model extracted a total of 444 entities from the Benzene article. The extraction of these entities took less than a second. I placed the results into a dataframe and visualized the label counts in a count plot below:

ChemNER results on Benzene Wikipedia article. Image by author.

The most common class within that article was alkene which makes sense given that that's the class of compound Benzene corresponds to. Something that I thought was a bit surprising was that this particular article had entities belonging to each class.

This is neat, however, a quick inspection of the first few rows in the dataframe of extracted entities we can see that there are issues with the model. The words ‘chemical ‘and ‘hexagonal' were labeled as an aldehyde and the word ‘one' was labeled as a ketone. These are clearly not chemical compounds and should not be classified as such. I went ahead and manually identified each entity as being correct or not and I determined that the extraction accuracy was 70.3%. Though all the extracted entities that were extracted were labeled ‘correctly' based on the rules the model learned, the model has not yet truly learned the context of the words.

Comparison of correctly and incorrectly labeled entities on Benzene article by ChemNER. Image by author.

The cool thing that I noticed though, is that the correctly labeled entities were all chemical compounds. In other words, if we had a way to determine whether an entity is a chemical compound, then we could significantly bolster the labeling performance of this application.

At this point, there are a couple avenues we can take. One avenue is to go back to the corpus and produce more data to give our model examples to learn from. Another avenue, is to use named entity linking (NEL) to help correct the labeling. I'll go with the latter option since it is a little less time consuming.

Using PubChem for NEL

The ChemNER model is performing exceptionally well at labeling entities according their chemical class so long as the entity is a chemical compound. In order to better inform the model, I'll connect to PubChem via their API and conduct a query for a chemical compound. The idea here is that a query for a chemical compound will return information whereas a query for nonchemical compound will return an empty query. I can use the results of this query to improve the labeling performance of my application.

As an example to showcase this, let's query for Benzene to start off. The code below can be used to query the PubChem API.

def get_compound_info(compound_name):
    base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name"
    response = requests.get(f"{base_url}/{compound_name}/JSON")
    if response.status_code == 200:
        return response.json()
    else:
        return None
compound_name = "benzene"
compound_info = get_compound_info(compound_name)

The results of this query are shown below.

Results of Benzene query via the PubMed API. Image by author.

There is A LOT of information on Benzene from this query that we can use later. But for now, all that matters is that the query returned something. On the other hand, if I use this same method to query for something that is not a chemical compound, like the word ‘humans' or ‘giraffe', for instance, then the result of that query is ‘None'.

Querying for non-chemical compounds on PubMed API. Image by author.

I can use this to my advantage to aid my application. The queries are quite fast, however, to speed up the process a bit more, I'll remove any duplicate entities from my dataframe in order to query only unique terms. In addition to this, the PubChem API appears to assume that we are querying for an individual chemical compound so words like cinammaldehydes for instance would return an empty query. This can be easily fixed by stripping any terminal ‘s' from any plural terms. I used the following code to create a new column in my dataframe called ‘Chemical Compound' that will allow me to classify each entity as either a chemical compound or not based on the result of the query.

That worked quite well! One more thing that I noticed upon doing this however is that the class labels themselves result in null queries. In other words, if I query PubChem for alkane, alkene, alkyne, etc. I get an empty query because these are not specific compounds themselves but rather classes of compounds. There is a bit of nuance here regarding how to proceed. I decided that I did want these classes of compounds to be recognized as chemical entities since the class labels can be used independently in sentences devoid of specific compounds (e.g., Alkanes are commonly found in petrochemical applications). To resolve this, I simply added a simple routine that would check whether the entry in the Entity column is either a singular or plural variant of any of our class labels and then set the value in the Chemical Compound column to 1 if the entity matches the label or 0 if it doesn't.

# List of specific chemical compound types
chemical_compounds = ['alkane', 'alkene', 'alkyne', 'ketone', 'aldehyde', 'alcohol', 'carboxylic acid']

# Function to update 'Chemical Compound' column
def update_chemical_compound(row):
    entity = row['Entity'].lower()
    if any(compound in entity for compound in chemical_compounds + [c + 's' for c in chemical_compounds]):
        return 1
    return row['Correct']

# Apply the function to each row
df_unique['Chemical Compound'] = df_unique.apply(update_chemical_compound, axis=1)

Cool! Now I can now merge these results into the original dataframe containing all 444 results.

df_merged = pd.merge(df_ents2, df_unique[['Entity', 'Chemical Compound']], on='Entity', how='left')

Entity dataframe after using the PubChem API to check whether an entity is a chemical compound. Image by author.

Next, I'll drop any rows that don't correspond to a chemical compound.

# Dropping rows where 'Chemical Compound' is 0
df_filtered = df_merged[df_merged['Chemical Compound'] != 0]

Resulting dataframe after removing entities that are not chemical compounds. Image by author.

And now let's see how it performed!

Results of ChemNER after performing NEL via PubChem. Image by author.

Very nice! All of the extracted entities are now correctly labeled. By combining our NER model alongside NEL via PubChem we are now able to not only extract the entities from the text but also disambiguate results and use that to vastly improve our labeling accuracy.

Deploying the Model into HuggingFace

As a little bonus, I thought it would be cool to take all of these routines I've showed and deploy the model into HuggingFace so that I can showcase it in a streamlit application. You can find the model in HuggingFace in this https://huggingface.co/victormurcia/en_chemner. The Inference API results are shown below which look pretty good:

ChemNER in action via the HuggingFace Inference API. Image by author.

Let me know if you use it or if you have any suggestions! I'm planning on expanding the model in the future and there are other functionalities I want to explore.

Connecting Everything with a Streamlit App

Now that the model is deployed, I can use it in a Streamlit App. This app will allow a user to either link to a Wikipedia article or enter raw text that will then be processed by the ChemNER model. The output of this routine will be a downloadable dataframe with the extracted and labeled entities, a count plot showing the counts for each of the labels in the provided text, and a fully annotated version of the text. You can find the Streamlit app hosted here: https://chemner-5i7mrvyelw79tzasxwy96x.streamlit.app/

Screenshot of ChemNER Streamlit App. Image by author.

As an example, I'll run a query for the Wikipedia article on Benzene below using the app. The result is an annotated version of the article as shown below where each class has been uniquely color coded.

Annotated text from ChemNER. Image by author.

The output is also a dataframe that you can download as a .csv file containing the entities and their corresponding labels and a count plot that shows

Output from Streamlit app. Image by author.

Conclusion

I hope you found this piece informative and helps you build your own NLP applications. I plan on continue to work on this model and application a bit more since I think there is some nifty stuff I'd like to further explore. For instance, after a bit of testing I noticed that there were still certain entities that the model extracted and the PubChem method classified as chemical compounds that were not organic compounds. For instance, the word ‘pm' was extracted as an entity and labeled as an aldehyde. The PubChem search returned a non-empty query since ‘pm' (or more appropriately Pm) is the chemical symbol for the element Promethium. The model is not perfect but I hope it shows that you can get a pretty powerful tool without requiring a LLM.

As always, thanks for reading!

Tags: Chemistry NLP Programming Python Science