From Encodings to Embeddings

In this article, we will talk about two fundamental concepts in the fields of data representation and Machine Learning: Encoding and Embedding. The content of this article is partly taken from one of my lectures in CS246 Mining Massive DataSet (MMDS) course at Stanford University. I hope you find it useful.
Introduction
All Machine Learning (ML) methods work with input feature vectors and almost all of them require input features to be numerical. From a ML perspective, there are four types of features:
- Numerical (continuous or discrete): numerical data can be characterized by continuous or discrete data. Continuous data can assume any value within a range whereas discrete data has distinct values. Example of continues numerical variable is
height
, and an example of discrete numerical variable isage
. - Categorical (ordinal or nominal): categorical data represents characteristics such as eye color, and hometown. Categorical data can be ordinal or nominal. In ordinal variable, the data falls into ordered categories that are ranked in some particular way. An example is
skill level
that takes values of [beginner
,intermediate
,advanced
]. Nominal variable has no order among its values. An example iseye color
that takes values of [black
,brown',
blue,
green`]. - Time series: Time series is a sequence of numbers collected at regular intervals over some period of time. This data is ordered in time unlike previous variables. An example of this is
average of home sale price over years in USA
. - Text: Any document is a text data, that we often represent them as a ‘bag of words'.
To feed any variables to an ML model, we have to convert them into numerical. Both encoding and embedding techniques do this trick.
Encoding
Encoding is the process of converting raw data, such as text, images, or audio, into a structured numerical format that can be easily processed by computers. There are two ways to encode a categorical variable:
1️⃣ Integer encoding
2️⃣ One-hot encoding
3️⃣ Multi-hot encoding (this is the extension of one-hot encoding)
To explain each method let's work through the following example: