Going Under the Hood of Character-Level RNNs: A NumPy-based Implementation Guide

Author:Murphy | View: 29252 | Time: 2025-03-23 20:00:34

Introduction

Recurrent neural networks (RNNs) are a powerful type of neural network that have the ability to process sequential data, such as time series or natural language. In this article, we will walk through the process of building a Vanilla RNN from scratch using NumPy. We will begin by discussing the theory and intuition behind RNNs, including their architecture and the types of problems they are well-suited for solving. Next, we will dive into the code, explaining the various components of the RNN and how they interact with one another. Finally, we will demonstrate the effectiveness of our RNN by applying it to a real-world dataset.

Specifically, we will be implementing a many-to-many, character-level RNN that uses sequential, online learning. This means that the network processes the input sequences one character at a time and updates the parameters of the network after each character. This allows the network to learn on-the-fly and adapt to new patterns in the data as they are encountered.

A character-level RNN means that the input and output are individual characters, rather than words or sentences. This allows the network to learn the underlying patterns and dependencies between characters in a piece of text. The many-to-many architecture refers to the fact that the network receives an input sequence of characters and generates an output sequence of characters. This is different from a many-to-one architecture, where the network receives an input sequence and generates only one output, or a one-to-many architecture, where the network receives only one input and generates an output sequence.

I used Andrej Karpathy's code (found [here](https://github.com/j0sephsasson/numpy-NN)) as a foundation for my implementation, making several modifications to improve versatility and reliability. I expanded the code to support multiple layers, and also restructured it for better readability and reusability. This project builds on my previous work of creating a simple ANN using NumPy. The source code for that can be found here.

Theory & Intuition

RNNs can be contrasted with traditional feedforward neural networks (ANNs), which do not have a "memory" mechanism and process each input independently. ANNs are well-suited for problems where the input and output have a fixed size and the input does not contain sequential dependencies. In contrast, RNNs are able to handle variable length input sequences and maintain a "memory" of past inputs through a hidden state.

The hidden state allows RNNs to capture temporal dependencies and make predictions based on the entire input sequence. To summarize, the network uses information from previous time steps to inform its processing of current inputs. Additionally, more complex NLP architectures can handle long-term dependencies (GPT-3 was trained using a sequence length of 2048), where information from the beginning of the input sequence is still relevant for predicting the output at the end of the sequence. This ability to maintain a "memory" gives RNNs & transformers a significant advantage over ANNs when it comes to processing sequential data.

Recently, transformer architectures such as GPT-3 and BERT have become increasingly popular for a wide range of NLP tasks. These architectures are based on self-attention mechanisms that allow the network to selectively focus on different parts of the input sequence. This allows the network to capture long-term dependencies without the need for recurrence, making it more efficient and easier to train than RNNs. The transformer architectures have been shown to achieve state-of-the-art results on a wide range of NLP tasks and have been used in many real-world applications.

Although the transformer architectures are more complex than the vanilla RNNs and have different characteristics, the vanilla RNNs still have an important role to play in the field of deep learning. They are simple to understand, easy to implement and debug, and can be used as a building block for other more complex architectures. In this article, we will focus on the vanilla RNNs and peek under the hood to see how they really work.

Three main types of vanilla RNNs are:

one-to-many: input a picture of dog and output ‘picture of dog'
many-to-one: input a sentence and recieve a sentiment (sentiment analysis)
many-to-many: input a sentence and output complete sentence (seen below)

We will be implementing the many-to-many architecture as seen below.

**Source:** Kaivan Kamali, Deep Learning (Part 2) – Recurrent neural networks (RNN) (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/RNN/tutorial.html

From this point on, we will denote the hidden state at time step t using h[t]. In the figure this is s[t].

As you can see, the hidden state from the previous time step h[t-1] is combined with the current input x[t], and this is repeated over the number of time steps. Inside the RNN block, we are updating the hidden state for the current time step.

For clarification, a time step is just a character, such as ‘a' or ‘d'. An input sequence contains a variable number of characters or time steps, also known as sequence length, which is a hyper-parameter of the network.

The Code

Prepare Data
RNN Class
Forward Pass
Backward Pass
Optimizer
Training

Prepare Data

## start with data
data = open('path-to-data', 'r').read() # should be simple plain text file

chars = list(set(data))
data_size, vocab_size = len(data), len(chars)

print('data has {} characters, {} unique.'.format(data_size, vocab_size))

char_to_idx = { ch:i for i,ch in enumerate(chars) }
idx_to_char = { i:ch for i,ch in enumerate(chars) }

We are reading in the data as a string, from a plain text file, and tokenizing the characters. Each unique character (there are 65), will be mapped to an integer and vice-versa.

Let's sample an input & target sequence to our RNN.

pointer, seq_length = 0, 8

x = [char_to_idx[ch] for ch in data[pointer:pointer+seq_length]]

y = [char_to_idx[ch] for ch in data[pointer+1:pointer+seq_length+1]]

print(x)
>> [2, 54, 53, 62, 13, 28, 20, 54] # our RNN input sequence

print(y)
>> [54, 53, 62, 13, 28, 20, 54, 13] # our RNN target sequence

for t in range(seq_length):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

>>  when input is [2] the target: 54
    when input is [2, 54] the target: 53
    when input is [2, 54, 53] the target: 62
    when input is [2, 54, 53, 62] the target: 13
    when input is [2, 54, 53, 62, 13] the target: 28
    when input is [2, 54, 53, 62, 13, 28] the target: 20
    when input is [2, 54, 53, 62, 13, 28, 20] the target: 54
    when input is [2, 54, 53, 62, 13, 28, 20, 54] the target: 13

The inputs are a tokenized sequence, the targets are the inputs offset by one.

RNN Class

class RNN:
    def __init__(self, hidden_size, vocab_size, seq_length, num_layers):
        pass 

    def __call__(self, *args: Any, **kwds: Any):
        """RNN Forward Pass"""

        pass 

    def backward(self, targets, cache):
        """RNN Backward Pass"""

        pass

    def update(self, grads, lr):
        """Perform Parameter Update w/ Adagrad"""

        pass

    def predict(self, hprev, seed_ix, n):
        """
        Make predictions using the trained RNN model.

        Parameters:
        hprev (numpy array): The previous hidden state.
        seed_ix (int): The seed letter index to start the prediction with.
        n (int): The number of characters to generate for the prediction.

        Returns:
        ixes (list): The list of predicted character indices.
        hs (numpy array): The final hidden state after making the predictions.
        """

        pass

Lets start by discussing the components of an RNN, in contrast to a basic ANN.

In a traditional feedforward neural network, the parameters governing the interactions between layers are represented by a single weight matrix, denoted as W. However, in a Recurrent Neural Network (RNN), the interactions between layers are represented by multiple matrices. In my code, these matrices are specifically: Wxh, Whh, and Why, representing the weights between input and hidden layers, hidden to hidden layers, and hidden to output layers respectively.

The Wxh matrix connects the input layer to the hidden layer, and is used to transform the input at each time step into a set of activations for the hidden layer. The Whh matrix connects the hidden layer at time step t-1 to the hidden layer at time step t, and is used to propagate the hidden state from one time step to the next. The Why matrix connects the hidden layer to the output layer, and is used to transform the hidden state into the final output of the network.

In summary, the main difference between the weights in an ANN and an RNN is that the ANN has one weight matrix, while the RNN has multiple weight matrices that are used to transform the input, propagate the hidden state, and produce the final output. These multiple weights matrices in the RNN allow it to maintain a memory of past inputs and move information through time.

The Constructor

def __init__(self, hidden_size, vocab_size, seq_length, num_layers):
    self.name = 'RNN'
    self.hidden_size = hidden_size
    self.vocab_size = vocab_size
    self.num_layers = num_layers

    # model parameters
    self.Wxh = [np.random.randn(hidden_size, vocab_size)*0.01 for _ in range(num_layers)] # input to hidden
    self.Whh = [np.random.randn(hidden_size, hidden_size)*0.01 for _ in range(num_layers)] # hidden to hidden
    self.Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
    self.bh = [np.zeros((hidden_size, 1)) for _ in range(num_layers)] # hidden bias
    self.by = np.zeros((vocab_size, 1)) # output bias

    # memory variables for training (ada grad from karpathy's github)
    self.iteration, self.pointer = 0, 0
    self.mWxh = [np.zeros_like(w) for w in self.Wxh]
    self.mWhh = [np.zeros_like(w) for w in self.Whh] 
    self.mWhy = np.zeros_like(self.Why)
    self.mbh, self.mby = [np.zeros_like(b) for b in self.bh], np.zeros_like(self.by)
    self.loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

    self.running_loss = []

Here we are defining our RNN parameters as discussed above. Something interesting to take note of – the parameters Why and by represent a linear layer, and could be abstracted even more, into a separate class such as PyTorch's ‘nn.Linear' module. However, we will keep them as a part of the RNN class for this implementation.

Forward Pass

def __call__(self, *args: Any, **kwds: Any) -> Any:
    """RNN Forward Pass"""

    x, y, hprev = kwds['inputs'], kwds['targets'], kwds['hprev']

    loss = 0
    xs, hs, ys, ps = {}, {}, {}, {} # inputs, hidden state, output, probabilities
    hs[-1] = np.copy(hprev)

    # forward pass
    for t in range(len(x)):
        xs[t] = np.zeros((self.vocab_size,1)) # encode in 1-of-k representation
        xs[t][x[t]] = 1
        hs[t] = np.copy(hprev)

        if kwds.get('dropout', False): # use dropout layer (mask)

            for l in range(self.num_layers):
                dropout_mask = (np.random.rand(*hs[t-1][l].shape) < (1-0.5)).astype(float)
                hs[t-1][l] *= dropout_mask
                hs[t][l] = np.tanh(np.dot(self.Wxh[l], xs[t]) + np.dot(self.Whh[l], hs[t-1][l]) + self.bh[l]) # hidden state
                hs[t][l] = hs[t][l] / (1 - 0.5)

        else: # no dropout layer (mask)

            for l in range(self.num_layers):
                hs[t][l] = np.tanh(np.dot(self.Wxh[l], xs[t]) + np.dot(self.Whh[l], hs[t-1][l]) + self.bh[l]) # hidden state

        ys[t] = np.dot(self.Why, hs[t][-1]) + self.by # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
        loss += -np.log(ps[t][y[t],0]) # softmax (cross-entropy loss)

    self.running_loss.append(loss)

    return loss, hs[len(x)-1], {'xs':xs, 'hs':hs, 'ps':ps}

Let's start at the top, and break it down. What is happening here?

Outside the loop, once per sequence

hprev is our initial hidden state
We are initializing dictionaries to hold our inputs, hidden states, logits, and probabilities. We will need this during the backward pass.
We are initializing the loss to zero.
We are setting our initial hidden state hprev equal to hs[-1] (reprsenting time step t-1).

Inside the loop, for every time step in the sequence

Perform one-hot encoding on our input sequence
Update the hidden state at time step ‘t' and layer ‘l'

Mathematical notation of our hidden state. Image by author.

Also, you probably noticed there is some functionality to perform dropout.

Dropout is a regularization technique that aims to prevent overfitting by randomly "dropping out" (setting to zero) certain neurons during the training process. In the code above, the dropout layer is applied before updating the hidden state at time step t, layer l, by multiplying the hidden state at t-1 with a dropout mask. The dropout mask is generated by creating a binary mask where each element is 1 with probability p and 0 otherwise. By doing this, we are randomly "dropping out" a certain number of neurons in the hidden state, which helps to prevent the network from becoming too dependent on any single neuron. This makes the network more robust and less likely to overfit on the training data. After applying dropout, the hidden state is scaled back by dividing it by (1-p) to ensure that the expected value of the hidden state is maintained.

ys[t] gives us our linear layer output for the current time step
ps[t] gives us our final softmax output (probabilities) for the current time step

Calculations for ys[t] and ps[t] are outside the second loop because there is only one linear layer, as opposed to an arbitrary number of RNN layers.

Finally, we return the loss, and hs[len(x)-1] is used as hprev for our next sequence. We use the cache to fetch the gradients during the backward pass.

The choice was made to use the indexing [t][l] to store the hidden state for the l-th layer at time step t. This is because the model processes the input sequence one timestep at a time, and at each timestep, it updates the hidden state for each layer. By using the indexing [t][l], we are able to keep track of the hidden state for each layer at each timestep, allowing us to easily perform the necessary computations for the forward pass.

Additionally, this indexing allows for easy access to the hidden state of the last timestep, which is returned as hs[len(x)-1], as it is the hidden state of the last timestep in the sequence for each layer. This returned hidden state is used as the initial hidden state for the next sequence during the training process.

Let's perform the forward pass. Remember, there is no batch dimension.

# Initialize RNN
num_layers = 3
hidden_size = 100
seq_length = 8

rnn = RNN(hidden_size=hidden_size, vocab_size=vocab_size, seq_length=seq_length, num_layers=num_layers)

x = [char_to_idx[ch] for ch in data[rnn.pointer:rnn.pointer+seq_length]]

y = [char_to_idx[ch] for ch in data[rnn.pointer+1:rnn.pointer+seq_length+1]]

# initialize hidden state with zeros
hprev = [np.zeros((hidden_size, 1)) for _ in range(num_layers)] 

## Call RNN
loss, hprev, cache = rnn(inputs=x, targets=y, hprev=hprev)

print(loss)
>> 33.38852380987117

Backward Pass

First, some intuition for the backwards pass of a RNN.

The key difference between backpropagation in a basic ANN and an RNN is the way the error is propagated through the network. While both ANNs and RNNs propagate the error from the output layer to the input layer, RNNs also propagate the error backwards through time, adjusting the weights and biases at each time step. This allows RNNs to process sequential data and maintain a "memory" in the form of its hidden state.

The BPTT (backpropagation through time) algorithm works by unrolling the RNN over time, creating a computational graph for each time step. The graph for this network can be seen here.

The gradients are then calculated for each time step and accumulated over the entire sequence.

def backward(self, targets, cache):
    """RNN Backward Pass"""

    # unpack cache
    xs, hs, ps = cache['xs'], cache['hs'], cache['ps']

    # initialize gradients to zero
    dWxh, dWhh, dWhy = [np.zeros_like(w) for w in self.Wxh], [np.zeros_like(w) for w in self.Whh], np.zeros_like(self.Why)
    dbh, dby = [np.zeros_like(b) for b in self.bh], np.zeros_like(self.by)
    dhnext = [np.zeros_like(h) for h in hs[0]]

    for t in reversed(range(len(xs))):

        dy = np.copy(ps[t])

        # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
        dy[targets[t]] -= 1 

        dWhy += np.dot(dy, hs[t][-1].T)
        dby += dy

        for l in reversed(range(self.num_layers)):
            dh = np.dot(self.Why.T, dy) + dhnext[l]
            dhraw = (1 - hs[t][l] * hs[t][l]) * dh # backprop through tanh nonlinearity
            dbh[l] += dhraw
            dWxh[l] += np.dot(dhraw, xs[t].T)
            dWhh[l] += np.dot(dhraw, hs[t-1][l].T)
            dhnext[l] = np.dot(self.Whh[l].T, dhraw)

    return {'dWxh':dWxh, 'dWhh':dWhh, 'dWhy':dWhy, 'dbh':dbh, 'dby':dby}

Same as the forward pass, let's break it down.

The first thing this function does is initialize the gradients for the weights and biases to zero, similar to what happens in a feedforward ANN. This is something that confused me, so I am going to elaborate a bit more.

By resetting the gradients to zero before every sequence, it ensures that the gradients calculated for the current sequence do not accumulate or add up with the gradients calculated in the previous sequences.

This prevents the gradients from becoming too large, which can cause the optimization process to diverge and negatively impact model performance. Additionally, it allows for weight updates to be performed independently for each sequence, which can lead to more stable and consistent optimization.

Then, it loops through the input sequence in reverse, performing the following computations for each time step t:

Notice the comment, backprop into y, that link will explain what is happening perfectly. I also go into depth on this in a previous article you can check out here.

Calculates the gradient of the hidden state hs[t][l] with respect to the loss, denoted by dh
Calculates the gradient of the raw hidden state, denoted by dhraw

What is the difference between dh and dhraw? Good question.

The difference between dh and dhraw is that dh is the gradient of the hidden state hs[t][l] with respect to the loss, computed by backpropagating the gradient of the probabilities ps[t] of the softmax activation of the output layer. dhraw is the same gradient, but it is further backpropagated through the non-linear tanh activation function by element-wise multiplying the gradient of the hidden state dh with the derivative of the tanh function, which is (1 – hs[t][l] * hs[t][l]).

Calculates the gradient of the hidden bias bh[l]
Calculates the gradient of the input-hidden weights Wxh[l] with respect to the loss, denoted by dWxh[l].
Calculates the gradient of the hidden-hidden weights Whh[l] with respect to the loss, denoted by dWhh[l].
Calculates the gradient of the next hidden state dhnext[l].

Mathematical notation of backward pass computations. Image by author.

Let's perform the backward pass.

# Initialize RNN
num_layers = 3
hidden_size = 100
seq_length = 8

rnn = RNN(hidden_size=hidden_size, vocab_size=vocab_size, seq_length=seq_length, num_layers=num_layers)

x = [char_to_idx[ch] for ch in data[rnn.pointer:rnn.pointer+seq_length]]

y = [char_to_idx[ch] for ch in data[rnn.pointer+1:rnn.pointer+seq_length+1]]

# initialize hidden state with zeros
hprev = [np.zeros((hidden_size, 1)) for _ in range(num_layers)] 

## Call RNN
loss, hprev, cache = rnn(inputs=x, targets=y, hprev=hprev)
grads = rnn.backward(targets=y, cache=cache)

Finally, we return the gradients so we can update the parameters, and that is a segway into my next topic – the optimization.

Optimizer

Like it says in the RNN class ‘update' method, we will be using Adagrad for this implementation.

Adagrad is an optimization algorithm that adapts the learning rate of each parameter in a neural network individually, based on the historical gradient information.

It's particularly useful for handling sparse data and is often used in natural language processing tasks. Adagrad makes adjustments to the learning rate at each iteration, ensuring that the model learns from the data as quickly and efficiently as possible.

def update(self, grads, lr):
    """Perform Parameter Update w/ Adagrad"""

    # unpack grads
    dWxh, dWhh, dWhy = grads['dWxh'], grads['dWhh'], grads['dWhy']
    dbh, dby = grads['dbh'], grads['dby']

    # loop through each layer
    for i in range(self.num_layers):

        # clip gradients to mitigate exploding gradients
        np.clip(dWxh[i], -5, 5, out=dWxh[i])
        np.clip(dWhh[i], -5, 5, out=dWhh[i])
        np.clip(dbh[i], -5, 5, out=dbh[i])

        # perform parameter update with Adagrad
        self.mWxh[i] += dWxh[i] * dWxh[i]
        self.Wxh[i] -= lr * dWxh[i] / np.sqrt(self.mWxh[i] + 1e-8)
        self.mWhh[i] += dWhh[i] * dWhh[i]
        self.Whh[i] -= lr * dWhh[i] / np.sqrt(self.mWhh[i] + 1e-8)
        self.mbh[i] += dbh[i] * dbh[i]
        self.bh[i] -= lr * dbh[i] / np.sqrt(self.mbh[i] + 1e-8)

    # clip gradients for Why and by
    np.clip(dWhy, -5, 5, out=dWhy)
    np.clip(dby, -5, 5, out=dby)

    # perform parameter update with Adagrad
    self.mWhy += dWhy * dWhy
    self.Why -= lr * dWhy / np.sqrt(self.mWhy + 1e-8)
    self.mby += dby * dby
    self.by -= lr * dby / np.sqrt(self.mby + 1e-8)

This block of code updates the parameters of the RNN using the Adagrad optimization algorithm. It keeps track of the sum of squares of the gradients of the parameters (mWxh, mWhh, mbh, mWhy and mby) and divides the learning rate with the square root of that sum plus a small constant value of 1e-8, to ensure numerical stability, effectively adjusting the learning rate for each parameter. Additionally, it clips the gradients to prevent exploding gradients.

Adagrad adapts the learning rate for each parameter, performing larger updates for infrequent parameters and smaller updates for frequent parameters. Meaning, parameters that are updated infrequently, the learning rate will be larger, so that the model can make bigger adjustments to those parameters. On the other hand, for parameters that are updated frequently, the learning rate will be smaller, so that the model can make small adjustments to those parameters, preventing overfitting. This is in contrast to using a fixed learning rate, which could either under-correct or over-correct the parameters.

Let's perform the parameter update.

# Initialize RNN
num_layers = 3
hidden_size = 100
seq_length = 8

rnn = RNN(hidden_size=hidden_size, vocab_size=vocab_size, seq_length=seq_length, num_layers=num_layers)

x = [char_to_idx[ch] for ch in data[rnn.pointer:rnn.pointer+seq_length]]

y = [char_to_idx[ch] for ch in data[rnn.pointer+1:rnn.pointer+seq_length+1]]

# initialize hidden state with zeros
hprev = [np.zeros((hidden_size, 1)) for _ in range(num_layers)] 

## Call RNN
loss, hprev, cache = rnn(inputs=x, targets=y, hprev=hprev)
grads = rnn.backward(targets=y, cache=cache)
rnn.update(grads=grads, lr=1e-1)

Training

The final piece is actually training the network, where the input sequences are fed through the network, the error is calculated and the optimizer updates the weights and biases.

def train(rnn, epochs, data, lr=1e-1, use_drop=False):

    for _ in range(epochs):

        # prepare inputs (we're sweeping from left to right in steps seq_length long)
        if rnn.pointer+seq_length+1 >= len(data) or rnn.iteration == 0:

            hprev = [np.zeros((hidden_size, 1)) for _ in range(rnn.num_layers)]  # reset RNN memory

            rnn.pointer = 0 # go from start of data

        x = [char_to_idx[ch] for ch in data[rnn.pointer:rnn.pointer+seq_length]]
        y = [char_to_idx[ch] for ch in data[rnn.pointer+1:rnn.pointer+seq_length+1]]

        if use_drop:
            loss, hprev, cache = rnn(inputs=x, targets=y, hprev=hprev, dropout=True)
        else:
            loss, hprev, cache = rnn(inputs=x, targets=y, hprev=hprev)

        grads = rnn.backward(targets=y, cache=cache)
        rnn.update(grads=grads, lr=lr)

        # update loss
        rnn.loss = rnn.loss * 0.999 + loss * 0.001

        ## show progress now and then
        if rnn.iteration % 1000 == 0: 
            print('iter {}, loss: {}'.format(rnn.iteration, rnn.loss))

            sample_ix = rnn.predict(hprev, x[0], 200)
            txt = ''.join(idx_to_char[ix] for ix in sample_ix)
            print('Sample')
            print ('----n {} n----'.format(txt))

        rnn.pointer += seq_length # move data pointer
        rnn.iteration += 1 # iteration counter

## hyper-params
num_layers = 2
hidden_size = 128
seq_length = 13

# Initialize RNN
rnn = RNN(hidden_size=hidden_size, 
          vocab_size=vocab_size, 
          seq_length=seq_length, 
          num_layers=num_layers)

train(rnn=rnn, epochs=15000, data=data)

This block of code is pretty straightforward. We are performing a forward and backward pass, and updating the model parameters every epoch.

Something I would like to point out –

The loss is updated by a weighted average of the current loss and the previous loss.

The current loss is multiplied by 0.001 and added to the previous loss, which is multiplied by 0.999. This means that the current loss will only have a small impact on the total loss, while the previous losses will have a larger impact. This way, the total loss will not fluctuate as much, and will be more stable over time.

By using an EMA (exponential moving average), it is easier to monitor the performance of the network and detect when it is overfitting or underfitting.

Loss & text prediction at iteration zero. Image by author.

Loss & text prediction at iteration 14,000. Image by author.

The training process of our RNN has been successful, we can see the decrease in loss and the improved quality of the generated samples. However, it is important to note that generating original Shakespeare is a complex task, and this particular implementation is a simple vanilla RNN. Therefore, there is room for further improvement and experimentation with different architectures and techniques.

Conclusion

In conclusion, this article has demonstrated the implementation and training of a character-level RNN using Numpy. The many-to-many architecture and online learning approach allows the network to adapt to new patterns in the data as they are encountered, resulting in improved sample generation. While this network is quasi-capable of generating original Shakespeare text, it is important to note that this is a simplified version and there are many other architectures and techniques that can be explored for much better performance.

Full code & repo here.

Feel free to get in touch & ask questions, or make improvements to the code.

Thanks for reading!

Tags: Artificial Intelligence Data Science Machine Learning NLP