From Encodings to Embeddings

Author:Murphy  |  View: 26598  |  Time: 2025-03-23 12:45:55
credit: https://unsplash.com/

In this article, we will talk about two fundamental concepts in the fields of data representation and Machine Learning: Encoding and Embedding. The content of this article is partly taken from one of my lectures in CS246 Mining Massive DataSet (MMDS) course at Stanford University. I hope you find it useful.

Introduction

All Machine Learning (ML) methods work with input feature vectors and almost all of them require input features to be numerical. From a ML perspective, there are four types of features:

  1. Numerical (continuous or discrete): numerical data can be characterized by continuous or discrete data. Continuous data can assume any value within a range whereas discrete data has distinct values. Example of continues numerical variable is height, and an example of discrete numerical variable is age.
  2. Categorical (ordinal or nominal): categorical data represents characteristics such as eye color, and hometown. Categorical data can be ordinal or nominal. In ordinal variable, the data falls into ordered categories that are ranked in some particular way. An example is skill level that takes values of [beginner, intermediate, advanced]. Nominal variable has no order among its values. An example is eye color that takes values of [black, brown',blue,green`].
  3. Time series: Time series is a sequence of numbers collected at regular intervals over some period of time. This data is ordered in time unlike previous variables. An example of this is average of home sale price over years in USA.
  4. Text: Any document is a text data, that we often represent them as a ‘bag of words'.

To feed any variables to an ML model, we have to convert them into numerical. Both encoding and embedding techniques do this trick.

Encoding

Encoding is the process of converting raw data, such as text, images, or audio, into a structured numerical format that can be easily processed by computers. There are two ways to encode a categorical variable:

1️⃣ Integer encoding

2️⃣ One-hot encoding

3️⃣ Multi-hot encoding (this is the extension of one-hot encoding)

To explain each method let's work through the following example:

Tags: Data Science Deep Dives Machine Learning Machine Learning Course

Comment