Using Probabilistic Words in Data Science

Author:Murphy | View: 28721 | Time: 2025-03-23 19:53:45

Photo by Christina @ wocintechchat.com on Unsplash

When starting a new Data science model, you must evaluate the data given to you. Often, new data science initiatives begin with a dataset and connecting with a subject matter expert or another contact. The subject matter expert provides additional context to what the dataset means. This includes outliers or exceptions in the dataset and what the subject matter expert considers to be "normal" or "abnormal" or what "always" or "never" happened. But what if "never" doesn't mean "never" and "always" doesn't mean "always"? These words are known as "probabilistic words" and contain critical information that cannot be found elsewhere. This article discusses how to use these probabilistic words to learn about your data and improve your models.

Defining Probabilistic Words

Probabilistic words are words that express uncertainty or probability. They include words such as "maybe," "perhaps," "probably," "likely," "unlikely," "possible," "impossible," and so on. These words are used to indicate an implicit distribution of someone's beliefs or confidence in an event.

Everyone, consciously or unconsciously, assigns a probability to these probabilistic words in their heads. However, the exact probabilities of the words are often obscured by the personal interpretation of the words, which creates a poor connection to data. For example, does "usually" mean 40% or 80% of the time? This definition can vary between people and situations. Once the interpretation of these probabilities is extracted, they can be incorporated into the modeling development process.

The range of probabilities in people's minds can vary dramatically.

Probabilistic Words & Previous Work

Two notable studies involving probabilistic words include "Words of Estimative Probability" by Sherman Kent of the US Central Intelligence Agency (CIA) in 1993 (Words of Estimative Probability (cia.gov) ). The second article is "If You Say Something Is ‘Likely,' How Likely Do People Think It Is?" written by Andre Mauboussin and Michael J. Mauboussin in the Harvard Business Review in 2018.

The study written by Kent looked to remedy that people often speak in non-specific statements about how likely an event might happen within the intelligence industry. In the article, he used a sample of reports to build a map between words and probabilities to put numbers to the uncertainty of people's responses. The original table output ranged from 0% of impossibility to 100% certainty, with a "general area of possibility" in between. The "general area of possibility" contains 7 probabilistic phrases. These words are (in order of increasing certainty): "impossible", "almost certainly not", "probably not", "changes about even", "probable", "almost certain", and "certain".

Later, Andre and Michael Mauboussin followed up on the study with a refreshed survey that included more diverse words. Their goal was to increase the number of people in the study and expand outside of the intelligence community. They polled users on the web to link words to their interpreted probabilities. The authors also sought to identify differences in other contextual aspects, such as gender and those that learned English as a second language. One of the lessons from their study is that people should use probabilities to explain data rather than probabilistic words to avoid misinterpretations when sharing data insights. In addition, people should use a clear methodology to collect the probabilities.

This section is just a brief overview of the articles, I would highly recommend reading them in their entirety. But the question remains – what does this mean for Data Science?

The range of "usual" to "rare" can vary between people.

Connecting vague word meanings to concrete examples is an excellent way to expand knowledge about a dataset and augment a dataset with additional knowledge. This additional information can increase the information available to your models and improve model performance.

Learn About Your Data Context

There are many different ways to learn more about a dataset in addition to the traditional data exploration techniques. This can be done through various approaches which usually involve discussions with an interviewee. This person could be a subject matter expert, a content moderation team, an accountant, a user, or a person related to the dataset, industry, or problem.

When speaking to the interviewee, be prepared to identify their probabilistic words. Start by asking about the general statistical behavior of the data such as correlations and treat each correlation as a hypothesis. This hypothesis is to be proven or disproven by the interviewee from their experience. The goal is to listen for qualifiers that people put before their opinion on an action. You can use these probabilistic words to identify what is normal, what is abnormal, and what is an outlier in their experience. This can also be used to double-check if your dataset aligns with your task. For example, is there a bias that you do not know about in the dataset or with the people involved? Is something that is considered abnormal by the interviewee happen a lot in the data?

When identifying normal data, abnormal data, and outliers, it helps to have the image of a distribution in your head or drawn out. By looking for probabilistic words, we are looking to identify where a sample of events from our interviewee falls on the distribution. However, it's essential to ensure that we collect data from the interviewee's point of view. An event that seems normal in the data could be very unexpected or unusual inside the business process – and that knowledge is gold when modeling.

An example of an implicit distribution regarding interest rate changes.

For example, put yourself into the shoes of a Data Scientist tasked with developing a model to predict if the Federal Reserve will increase interest rates. When learning about the interest rates, federal reserve moves, and market reactions, it is crucial to gather an expert's point of view to give us a sense of what they believe will impact these decisions. Say we ask what a fund manager thinks will happen to interest rates, and they say, "it will probably increase at a slower rate." In this case, it is important to ask the trader to quantify in probabilistic terms (ratios, percentages, etc.) what "probably" means to them. By asking the trader for their understanding in percentage terms, we can begin to build an understanding of:

What does "probably" means to them, and what economic context is needed to make it happen
What would cause an almost certain positive or negative interest rate move
What they think the Federal Reserve usually does in this situation
What would an unusual negative interest rate move look like
What would an unusual positive interest rate move look like

Follow up by asking the trader to describe the context of each of the responses in detail. Clarify the probabilistic likelihood of each word the trader refers to and look for verbally explained statistical insights. These insights include multi-collinearity, secondary effects, and other sources that would affect the performance of that model but are not in the training dataset. To reduce the bias of the results, try to interview several people.

With this information, you are more knowledgeable about what could cause a rate change, and build out a sentiment of what traders believe. The data could even be used to build a web-scraped sentiment model that translates external sentiment into business sentiment.

Augment Your Dataset

Using probabilistic words, a dataset can be augmented to include insights. For example, you could add a categorical column that specifies a flag for "unusual" circumstances. You could use this data to predict feedback on a larger data set, use it as raw input to a Machine Learning model, and quantify the value of this "human" information to your data.

An example dataset with feedback integrated into the data.

Using the data to predict feedback on a larger dataset is called "weak learning". In this case, a model is built that uses the sample of feedback to predict what feedback would be on the rest of the dataset. This means that a sample of feedback and be expanded to cover the whole dataset. This expanded feedback can then be used as input to another model or in exploratory data analysis. The benefit of this approach is that a sample of data can be expanded to cover a large set of data. However, this comes at the cost of accuracy. Since the model is trained on a small sample, there is a higher possibility that the model will have increased bias or not fully behave in a way that the interviewee would.

If you need instant feedback to make predictions as part of a model, then "weak learning" can be used to build a "feedback on the fly" system. As the model makes online predictions, the "weak learning" model takes the incoming raw data, predicts what feedback would be, and then passes the raw data and predicted feedback to the main online model. This allows you to build a functional model without needing a human involved all the time.

Another great part about collecting this data is that it can be used to quantify how much more information interviews provide than raw data. This can be evaluated by building a model with and without feedback. After both models are trained, compare the difference between model scores and that gives you the relative value of the feedback. If your model scores 15% better with the feedback data than the model trained only on raw data, then it is proven that interviews improve performance for your model. If the 15% improvement can be linked to business impact, this can help justify interviewing costs and give a dollar value of the feedback. For example, if a forecasting model improves performance by 15%, and that translates to $200,000 of value, then the feedback is worth that $200,000.

Build Your Own Probabilistic Survey

Now that probabilistic words have been introduced, this knowledge can be used to create your probabilistic surveys.

To get started, take inspiration from typical common probabilistic words from the original studies referenced earlier. Feel free to add your own words and probabilistic words frequently used by your organization. It may also be beneficial to take a week and identify common words used in your meetings and keep a running list of the most common items. An example used in my professional background is "uncertainty". If I am writing a probabilistic survey for my organization, I would want to include items like "great uncertainty" and "almost certain" to the list of words to get feedback on. Remember, you can always add words later on and gather more responses, so perfection isn't required.

Once a list of words has been gathered, an architecture to collect the words from people or another data source needs to be created. If you already have a data source, you can use your favorite method to input the data into your processes. If you're collecting feedback from people inside your organization, it is beneficial to have a simple survey architecture set up to facilitate the collection of information. This could include something like Google Forms, Microsoft Forms, and Streamlit. My usual go-to is Streamlit as it is quick to set up, is built in Python, and I can run it quickly on my local PC as needed or on its website.

You could also collect basic metadata about each person as they supply feedback. Metadata like the level in the company, department, and years of experience could be useful to segment out how different departments use probabilistic words. Once a significant amount of feedback has been collected, you can collect the data and analyze the distributions of each of the responses. From these distributions, you will be able to answer questions like:

What words have the same probability meaning?
Do words that have similar definitions have different probabilities associated with them?
For example, "certainty" and "sure thing" could have 70% and 95% average probabilities associated with them.
What is the spread of probabilities for a word?
What are the distribution characteristics of the words? (Mean, median, mode, standard deviation, etc.)

After this analysis, you can use this information to help connect the dots between nondeterministic words collected during interviews, feedback sessions, and exploratory data analysis. This data can be used in a data pipeline to classify probabilities in statements, identify unique situations, and improve model outcomes.

All images are by the author unless otherwise specified.

Tags: Data Data Science Machine Learning