The Beginning of Information Extraction: Highlight Key Words and Obtain Frequencies
Introduction
With the amount of available information increasing every day, having the ability to quickly gather relevant statistics about said information is important for relationship mapping and acquiring a new perspective on otherwise redundant data. Today we will look at text extraction, also known as information extraction, of PDFs and a quick approach to formulating some facts and ideas about different corpora. Today's article dives into the field of Natural Language Processing (NLP), which is a computer's ability to comprehend human language.
Information Extraction
Information Extraction (IE), as defined by Jurafsky et al, is the "process for turning unstructured information embedded in texts into structured data." [1]. A very quick way of information extraction is not only to search to find if a word is located within a body of the text but also to calculate the frequency of how many times that word is mentioned. This is supported by the assumption that the more a word is mentioned within a body of text, the more important it is and its relation to the corpus's theme. It's important to note that stopword removal is important for this given process. Why? Well, if you simply calculated all of the word frequencies within a corpus, the word the will be mentioned a lot. Does that make this word important in terms of relaying what information is within the text? No, and therefore you want to ensure you are looking at frequencies of words that contribute to the semantic meaning of your corpora.
IE can lead to other NLP techniques being used on a document. These techniques go beyond the code of this article but I felt they were both interesting and important the share.
The first technique is Named Entity Recognition (NER). As detailed by Jurafsky et al. "The task of named entity recognition (NER) is to find each named entity recognition mention of a named entity in the text and label its type." [1] This is similar to the idea of searching for the frequencies of the words within a body of text, but NER takes it another step further by using word location as well as other literary rules for finding different entities within a body of text.
Another technique that can be supported by IE is Relation Extraction, which is "finding and classifying semantic relation extraction relations among the text entities." [1]. The goal is not only to be able to extract how frequently different words are in text, and the entity label for which they fall, but also how we can relate these different entities together to formulate either the underlying semantic meaning, patterns, or summary of a corpus of text.
The previous three tasks all relate to the goal of Event Extraction. Jurafsky et al. go on to state that "Event extraction is finding event extraction events in which these entities participate." [1]. What this says is that by finding and extracting different statistics and labels from a body of text, we can begin to articulate hypotheses on what event is occurring from these events and how, through these entities, different events may be related.
The Process

The goal of this process is to get a quick understanding of whether a corpus of text not only has information relating to what you are looking for but also to give the frequency for understanding if there is enough of said information to make comparisons between different documents. Finally, it provides a visualization of the queried word highlighted in the text which can help with drawing conclusions about relationships and what the text is about based on the surrounding words. The entire process was implemented in Python using Google Colab and can be easily transferred to any IDE of your choice. Let's take a look at the code!
The Code
The Python libraries needed for this code are PyMuPDF and Counter.
import fitz # PyMuPDF
from collections import Counter
Next, we will create a function called "hightlight_terms" which accepts an input pdf, a path to an output pdf, a path to an output text file, and the terms we want to be highlighted.
def highlight_terms_and_count(input_pdf_path, output_pdf_path, terms_to_highlight, output_text_file):
"""
A function which accepts a PDF file and a sting of words as input
and outputs a highlighted PDF file of the queried words and a text file
with the query word frequences.
Arguments:
input_pdf_path (str): Path to a PDF file
output_pdf_file (str): Path to the output pdf file
terms_to_highlight (list): List of terms (str) to highlight
output_text_file (str): Path to output text file.
Returns
output_pd_file : A PDF highlighted with the queried words.
output_text_file: A text file containing the frequency of each queried word.
"""
# Open the PDF file
pdf_document = fitz.open(input_pdf_path)
term_counter = Counter()
for page_number in range(len(pdf_document)):
page = pdf_document[page_number]
# Get the text on the page
text = page.get_text()
for term in terms_to_highlight:
term_instances = page.search_for(term)
term_counter[term] += len(term_instances) # Count term instances on this page
for term_rect in term_instances:
# Create a highlight annotation
highlight = page.add_highlight_annot(term_rect)
# Set the color of the highlight (e.g., yellow)
highlight.set_colors(stroke=(1, 1, 0))
# Set the opacity of the highlight (0 to 1)
highlight.set_opacity(0.5)
# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
# Save term frequencies to a text file
with open(output_text_file, 'w') as text_file:
for term, frequency in term_counter.items():
text_file.write(f"{term}: {frequency}n")
Once we create the function, we can set up the file paths to our different directories.
if __name__ == "__main__":
input_pdf_path = "/content/AlexNet Paper.pdf" # Replace with your input PDF file
output_pdf_path = "/content/output.pdf" # Replace with your output PDF file
terms_to_highlight = ["neural", "networks"] # Add the terms you want to highlight
output_text_file = "/content/term_frequencies.txt" # Text file to store term frequencies
highlight_terms_and_count(input_pdf_path, output_pdf_path, terms_to_highlight, output_text_file)
And that's it, easy as that! This can be super helpful for your analysis in getting a quick understanding of how relevant a certain word (or words) is in a document. In addition, when viewing the highlighted output PDF, outlining the location of a word can help draw your attention and make relationships to other words surrounding the queried word.
Example
In today's example, let's look at a scholarly article from Google Scholar. What if, for example, I wanted to know if an article had adequate information about Neural Networks? We can query the document and then use the queried frequencies to make an assessment about whether the document focuses on Neural Networks. The first document we will look at is titled ImageNet Classification with Deep Convolutional Neural Networks [2].

Let's say, for example, we are interested in studying Neural Networks and want to see where and how many times those words appear within the text. By querying Neural and Networks, we will get a highlighted pdf, as shown below.

Additionally, a text file will be created with the frequencies of both Neural and Network, which were 21 and 23, respectively. The frequency number really depends on the user whether a word is mentioned enough in a document to deem it relevant. With both words appearing over 20 times within the document, I would argue they are relevant to its context.
This process can easily be expanded to multiple documents where you can then compare the frequencies of certain words between different documents. In doing so, we can compare documents and come to conclusions about which documents are more related to the information we are seeking to find.
Limitations
While the approach is simple and completes the desired task at hand, there are a few limitations worth discussing.
- The assumption that the higher frequency a word has in a corpus means it may support the underlying meaning more does not always hold.
- Querying for the wrong words could cause you to not uncover the correct meaning of a body of text or find the information you are looking for.
- The methodology is more of a summarization of information extraction and provides a starting point for a more in-depth analysis.
The first limitation around the assumption asserts that the more frequently a word is used, the more it supports the underlying meaning of the text depending on the author publishing the work and how they wish to express their ideas does not always hold true. Additionally, different words can be used to describe the same meaning, and by only focusing on a few different words and their high frequencies you may find yourself attaching the wrong meaning to a corpus of text.
The second limitation derives from the first regarding the word queries. If you are not querying the correct words, you may be missing the important underlying meeting of the corpus of text. One way to overcome this could be to convert the PDF to a text document using PyMuPDF or PyPDF2 and then calculate the N most frequent words. From there you can use those frequencies to find where those words are located in the body of text which can help help you find where the most important information sits.
The third limitation of today's methodology is that it does not go more in-depth into information extraction. With that being said, this code does set you up nicely to begin your text extraction project and can be branched off into a few different directions. Here are some possible ways you can move out from this process:
- begin comparing documents and selecting documents where a certain word and its frequency are higher
- Asses the highlighted regions of a document and create a script that will extract these regions. From there, perform sentiment analysis on each of the sentences.
- Same approach as just mentioned except now to perform topic modeling analysis. Any sentence where a word is highlighted will be used to support the different topics within the text.
Benefits
While I mentioned there are limitations to this process, it does still have a few benefits and I believe you will find some utility in its use. First, it can show us how relevant a word is in different bodies of text. This can be helpful in academic research especially when you are studying a certain topic. If I'm studying Machine Learning, and one academic article mentions it more than another, I may be more inclined to look at that article first. Once I pick an article, this process will then highlight where the words sit in that article. Instead of mindlessly having to sift through the literature, this process centers our attention right where the information lies within a body of text. Talk about time savings!
Another benefit of this process is you can use it to begin highlighting where certain sentences are which may be useful if you wish to cite information from documents in different pieces of work. Being able to trace back where you found said information is not only important for providing proof to the viewers of your work of where the information was accumulated from, but also it's a great way to ensure you are not plagiarizing your work. I tutor many students and I have used this process to highlight different articles for them so they can see where I was getting my work from (based on the topic we were studying) and use the highlighted documents as a reference for studying.
Finally, this process is a great kickstart to your Text Extraction project. It gathers preliminary statistics to begin shaping the direction of your project as well as offers you a visual piece for understanding a body of text and having evidence of innate meaning to give to your customer. A real-world example of this was, I was developing a project for a customer who worked primarily in the food industry. They wanted to know if this giant document (hundreds of pages) they were given mentioned any of the goods they were seeking to sell and they also wanted to know who those goods were associated with. I was able to query the different goods they sell and provide the document back with the products highlighted in the document for quick referencing. They really enjoyed it! My only suggestion would be to also begin annotating page numbers because that will cut down analysis time even more!
Conclusion
Today we looked at how you could analyze a PDF document, and find and locate words of interest within the document. Additionally, we can collect the frequencies of the queried words to gain a better understanding of the possible meaning of a corpus of text and if it is related to what we desire to know more about. Information extraction is critical because it can help us not only find relationships between different entities but also provide us with insights into possible patterns of actions and events conducted by these related entities and across various documents. Try this code out and I hope it helps you in your next NLP project!
If you enjoyed today's reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here (I receive a small commission when you do this)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!