Vector Representations for Machine Learning

Author:Murphy  |  View: 25308  |  Time: 2025-03-23 18:47:37
Photo by Sigmund on Unsplash

Machine learning engineers leverage numerical representations of the world to build and train predictive algorithms.

In the context of supervised learning, these representations allow the computer to learn the relationship between them and the target variable.

Let's imagine a vector to just be a list of numbers

X = [1, 2, 3, 4, 5]

This list is related to the target variable y

X = [1, 2, 3, 4, 5]; y = 1

The machine learning model learns the relationship between the features and targets and outputs a prediction – in this case a classification in which one of the classes is identified with the number 1.

In this post, I will write about how vectors can be used to represent complex concepts in a number format.

The rationale is that a machine learning model cannot learn from observations that are not provided to it in numerical format.

Text, images, sounds and other input observations must first be transformed into a numerical format suitable for learning.

There are various techniques for transforming a phenomenon into vectors, and these depend on the type of Data we are working with

  • We will start by introducing the concept of One-Hot Encoding, a technique used to represent words as numerical vectors
  • Next, we will discuss the limitations of this technique and introduce the concept of embeddings, a technique that allows words, images, sounds, and more to be represented as smaller numerical vectors than the thousands of categories required with One-Hot Encoding
  • We will also mention the TF-IDF and bag of words models, which are fundamental in text vectorization

How do we encode a phenomenon into a vector?

We will use text as an example to carry the conversation forward. The example is quite obvious because as we can guess, machine learning models cannot directly use text for their learning. We need to turn each character or word into a number first.

Let's say we want to create a numerical representation of words

  • King
  • Queen
  • Prince
  • Princess

The simplest way to encode these words would be to assign each of them a number, sequentially.

Image by author

The words have been correctly transformed into numerical format, following the mapping

map = {
 "King": 1,
 "Queen": 2,
 "Prince": 3,
 "Princess": 4
}

But there's a problem. If we fed this data to any predictive model, it would assign a higher mathematical value to the prince and princess, making them more important than the king and queen.

Obviously this would provide wrong information to the model, which would learn wrong relationships. We need to make our numerical representation more precise.

One-Hot Encoding

To solve the numeric representation problem described above, the One-Hot Encoding technique can be used.

In this case, each word would be represented by a numerical vector with a size equal to the total number of words to be represented. The vector would have all values equal to zero, except one, which represents the specific word.

For example, in the case of the four words "King", "Queen", "Prince" and "Princess", each word would be represented by an array of four elements, with the value "1" in the position corresponding to the word and "0" in all other positions.

This technique solves the problem of assigning a higher mathematical value to words that are no more important than others in number representation.

Image by author

Now our model has a "balanced" vectorial representation for each word belonging to the dataset (which in this case consists of only 4 words).

But… what if our vocabulary is made up of thousands or even millions of words? Considering that there are about 270,000 words in the Italian dictionary, applying one-hot encoding would be problematic to say the least.

The computational resources to carry out this coding would be considerable and the final representation would be "only" balanced: there is no information about the relationships between the words.

Embeddings

To overcome the limitations of One-Hot Encoding, the technique called embedding can be used. This allows words to be represented as numerical vectors of controllable size compared to the thousands of categories needed with One-Hot Encoding.

The idea is to create a numerical representation of the words that takes into account the semantic relationships between the words themselves.

In practice, each word is represented as a vector of real numbers, where each dimension represents a different aspect of the meaning of the word.

Understanding embeddings is simple: related words should appear close together in vector space, while unrelated words should appear distant.

Let's try to create a graph where we capture some of the characteristics of the words mentioned before.

Image by author

We see how close the words prince and princess are to each other, just like king and queen.

Assuming that the gender variable can assume only two values, M and F (we use 0 and 1), and that the age variable can assume only three values [Young, Middle-aged, Elderly] (we use 0, 1, 2), we see how embeddings can represent these relationships

Image by author

This representation manages to capture the noble status of an individual by using the dimensions of gender and age.

Moving on the X axis we can observe how the two nobles are equidistant from a dimension that represents the gender difference (0: male, 1: female). Moving on the Y axis, however, we can observe how age is represented by the distance of the embedding from the Y axis.

In this way, word embeddings can be used as input to machine learning models, allowing complex concepts to be more accurately represented in a numerical format.

In this example we have only two dimensions. In fact, neural networks are trained with the specific task of finding these representations on several dimensions.

To put that into perspective, models like GPT-3 use more than 12,000 dimensions.

A milestone in the industry

Embeddings can be used not only for words, but also to represent images, sounds, and more.

The use of vector representations is critical in today's machine learning. The various innovations and technologies in the field of deep learning cascade from the concept of vectorization.

Models like GPT-3.5 are born by crossing vector representations, well-studied optimization algorithms and large amounts of computational resources.

There is theoretically no limit to this approach.

More data → Higher quality vectors → Models that will use those vectors for better training.

Limits of embeddings

Although embeddings are a very useful technique for representing complex concepts in numerical format, they also have limitations.

In particular, it is important to underline that the embeddings are built starting from the training data, and therefore can be influenced by any bias present in the data.

As mentioned, the quality of the embeddings depends on the quality of the training data. If the training data is not representative of the domain in which the model will be used, the embeddings may not be able to capture all semantic relationships between concepts.

Also, embeddings can require a lot of memory to store, especially if the number of dimensions is large. This can be especially problematic for machine learning models that need to run on resource-constrained devices, such as mobile devices.

Other ways of representing text

Since text is the most common data format around us (just think of the huge amount of textual data on the internet), some text vectorization techniques are common and well known.

One of these is the TF-IDF transformation which is a text vectorization technique that assigns a weight to each word based on its frequency within a document and its overall frequency within the corpus.

This way, words that appear frequently within a document but rarely within the corpus will have more weight than those that appear frequently everywhere. This technique is widely used in the field of Natural Language Processing for text analysis.

I invite the interested reader to learn more about the TF-IDF model by reading the following article

Text Clustering with TF-IDF in Python

TF-IDF is based on the bag of words model which represents a document as an unordered set of words, ignoring sentence structure and word order.

In this way, the bag of words can be used to represent any document as an array of numeric values, where each value represents the frequency of a word within the document. Of course, there is no adequate representation of the relationship between words, which is provided by embeddings.

Conclusion

In this post we have seen how vectors can be used to represent complex concepts in a numeric format.

It is important for a data scientist to think in terms of vectorization. Questions like

  • how can i convert this stimulus into a number?
  • how is this data interpreted by the neural network?
  • How can I improve this representation?

are critical, and the team that can adequately answer these questions will create better systems.

Data scientists see the world in terms of vectors.


If you want to support my content creation activity, feel free to follow my referral link below and join Medium's membership program. I will receive a portion of your investment and you'll be able to access Medium's plethora of articles on Data Science and more in a seamless way.

Join Medium with my referral link – Andrea D'Agostino

Recommended Reads

For the interested, here are a list of books that I recommended for each ML-related topic. There are ESSENTIAL books in my opinion and have greatly impacted my professional career.

Disclaimer: these are Amazon affiliate links. I will receive a small commission from Amazon for referring you these items. Your experience won't change and you won't be charged more, but it will help me scale my business and produce even more content around AI.

Useful Links (written by me)

Tags: Computer Science Data Data Science Machine Learning Vectorization

Comment