Simplifying Transformers: State of the Art NLP Using Words You Understand – part 2- Input

Author:Murphy | View: 25912 | Time: 2025-03-23 18:00:26

Inputs

Dragon hatches from eggs, babies spring out from bellies, AI-generated text starts from inputs. We all have to start somewhere. What kind of inputs? it depends on the task at hand. If you're building a language model, a software that knows how to generate relevant text (the Transformers architecture is useful in diverse scenarios) the input is text. Nonetheless, can a computer receive any kind of input (text, image, sound) and magically know how to process it? it doesn't.

I'm sure you know people who aren't very good with words but are great with numbers. The computer is something like that. It cannot process text directly in the CPU/GPU (where the calculations happen), but it can certainly work with numbers! As you will soon see, the way to represent these words as numbers is a crucial ingredient in the secret sauce.

Image from the original paper by Vaswani, A. et al.

Tokenizer

Tokenization is the process of transforming the corpus (all the text you've got) into smaller parts that the machine can make better use of. Say we have a dataset of 10,000 Wikipedia articles. We take each character and we transform (tokenize) it. There are many ways to tokenize text, let's see how OpenAi's tokenizer does it with the following text:

"Many words map to one token, but some don't: indivisible.

Unicode characters like emojis may be split into many tokens containing the underlying bytes:

Tags: AI Deep Learning Machine Learning NLP Transformers