Reinforcement Learning for Inventory Optimization Series III: Sim-to-Real Transfer for the RL Model

Author:Murphy | View: 26496 | Time: 2025-03-23 20:01:35

Update: this article is the third article in my blog series Reinforcement Learning for Inventory Optimization. __ Below are the links to the other articles in the same series. Please check them out if you are interested.

Reinforcement Learning for Inventory Optimization Series I: An RL Model for Single Retailers

Reinforcement Learning for Inventory Optimization Series II: An RL Model for A Multi-Echelon Network

In previous articles, I built two RL models for a single retailer and a multi-echelon supply chain network. In these two articles, I trained the RL model using a simulator of the inventory system based on a historical demand dataset following a mixture of different normal distributions whose parameters depend on days of week. Then I tested the RL models on a test set in which the demand data follow the same distribution as the historical training data. This is essentially assuming the simulator perfectly matches the real world where we would apply the RL models to, and historical demand data perfectly represent the future demand pattern. However, this assumption is rarely true in real-world applications.

How to deal with the gap between the distributions in the training set and test set is a well-known problem in the general context of machine learning. The essence of this problem is the bias vs variance trade-off. If we rely too much on the training set to train our model, then the model is prone to overfitting, meaning it will only well on the training set but not on the test set. The same issue applies here in the RL field.

Performance Deterioration due to the Gap between Simulator and Real World

Taking the RL models I built previously as an example, the models were trained through a large number of episodes on the same historical demand data, hence it is highly likely that the inventory policies learned from the trained RL models are overfitted to the demand pattern represented by the historical demand data. The inventory policies will perform well if the same demand pattern continues in the future, but the performance will deteriorate if the future demand pattern deviates from the historical one.

As a numerical illustration, let's assume that the future demand distribution in the real world deviates from the historical demand data in the simulator used for training the RL models. I took the RL model trained in my first article and applied it to two future demand scenarios in real world. In one of the scenarios, we assume the means of the mixture of the normal distributions will increase by 1, while in the other scenario, the means will decrease by 1. The demand distribution structure is given by the table below.

Two future/real-world demand distribution scenarios (Image by author)

For both of the demand increase scenario and demand decrease scenario (second and third columns in the table above), I generated 100 demand datasets which consist of 52 weeks of data using their own demand distribution setting. Then I applied the DQN policy trained on the historical/simulator demand data (first column in the table above) to both scenarios, the corresponding average profits obtained are given in the table below.

Average profits obtained by applying the DQN policy trained on historical/simulator data to each demand scenario (Image by author)

To assess how well the trained DQN policy performs, for each of the demand scenarios, let's assume we know what the future/real-world demand distribution will deviate ahead of time, and generate a training set consisting of 52 weeks of demand data using its own future/real-world demand distribution setting. Then we train a brand new RL model using its own training set for each scenario and apply the new RL model to each scenario. The corresponding average profits obtained are given in the table below.

Average profits obtained by applying the DQN policy trained on each scenario's training set to each demand scenario (Image by author)

As we can see from the comparison in the two tables above ($25077.61 vs $27399.79 and $13890.99 vs $14707.44), there is a performance deterioration in the RL model if there is a gap between the simulator and real world.

Bridging the Gap Using Domain Randomization

To bridge the gap between the simulator and real world, in real applications, one may choose to select a shorter period of most recent historical demand data, and assume this shorter period represents the future demand pattern or trend better. Then we frequently retrain the RL model using the recently updated shorter period of historical demand data. However, we can also try to address this problem during the training process in the first place using the concept of domain randomization.

Domain randomization is a technique typically used for sim-to-real transfer when applying RL to robotics. The applications of RL in robotics also face the issue of reality gap, since the gap between the simulation and real world degrades the performance of the policies once the RL models are transferred into real robots[1]. In robotics context, the core idea of domain randomization is by randomizing the the physical parameters of the simulated environment (e.g., friction coefficients and vision properties such as objects' appearance) when training the RL model, the RL model will experience situations more like the real environment, hence the learned policies will generalize better to the real environment. The figure below illustrates the intuition behind domain randomization.

Intuition behind domain randomization (Image from reference [1])

In the context of inventory optimization, to implement domain randomization, we can add randomization to the historical demand data. To be specific, we extend the historical demand dataset by adding normally distributed noise to the demand observations. The implementation of this idea is given in the code block below.

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
demand_hist = []
for i in range(52):
    for j in range(4):
        random_demand = np.random.normal(3, 1.5)
        if random_demand < 0:
            random_demand = 0
        random_demand = np.round(random_demand)
        demand_hist.append(random_demand)
    random_demand = np.random.normal(6, 1)
    if random_demand < 0:
        random_demand = 0
    random_demand = np.round(random_demand)
    demand_hist.append(random_demand)
    for j in range(2):
        random_demand = np.random.normal(12, 2)
        if random_demand < 0:
            random_demand = 0
        random_demand = np.round(random_demand)
        demand_hist.append(random_demand)
plt.hist(demand_hist)

## add randomized noise
demand_hist_randomized = []
rand_list = []
for k in range(10):
    demand = demand_hist.copy()
    rand1 = np.random.normal(1, 1)
    rand2 = np.random.normal(1, 1)
    rand3 = np.random.normal(1, 1)
    rand_list.append([rand1,rand2,rand3])
    for i in range(52):
        for j in range(4):
            demand[i*7+j] += rand1
            if demand[i*7+j] < 0:
                demand[i*7+j] = 0
            demand[i*7+j] = np.round(demand[i*7+j])
        demand[i*7+4] += rand2
        if demand[i*7+4] < 0:
            demand[i*7+4] = 0
        demand[i*7+4] = np.round(demand[i*7+4])
        for j in range(5,7):
            demand[i*7+j] += rand3
            if demand[i*7+j] < 0:
                demand[i*7+j] = 0
            demand[i*7+j] = np.round(demand[i*7+j])
    demand_hist_randomized.extend(demand)

Here I sort of created 10 historical demand scenario by adding a random noise to the demand observation in each scenario. Then I appended the scenarios together as the entire historical demand data used for training, pretending that we have a 10 years' of demand data. Note that the essence of this idea also aligns with the Data Augmentation technique in computer vision tasks where we manipulate the images to enrich the training set to avoid overfitting.

I tried different values for the mean of the normal distribution used for generating the random noise. Initially I thought it makes sense to use a 0 mean, since it will more likely generate equal numbers of demand increase and decrease scenarios. Interestingly, after trying out different values, I found making the mean above 0 (more demand increase scenarios) gave better test results for this particular example. It might be due to the fact that the demand data are truncated at 0, hence although we generate noise with mean 0, there is not too much room for the demand distribution to shift left. So generating noise with a mean of a positive number will let the RL model learn more useful knowledge.

Now we train the DQN model using the new 10 years' historical demand data obtained by Domain Randomization and then apply the learned DQN policy to the demand increase and decrease scenarios (detailed setting in the first table) described in the previous section. The following table shows the average profits obtained in each scenario.

Average profits obtained in each scenario using the DQN policy trained after domain randomization (Image by author)

We see that there is a clear improvement as compared to the results without domain randomization ($26432.61 vs $25077.61 in increase scenario, $14311.52 vs $13890.99 in decrease scenario).

To show the performance improvement is not obtained by chance, I further created more randomized test demand scenarios by adding a random noise to the means and variances of the mixture of demand distributions. See the code below.

demand_test = []
for k in range(100,200):
    np.random.seed(k)
    rand1 = np.random.normal(0,1)
    rand2 = np.random.normal(0,1)
    rand3 = np.random.normal(0,1)
    rand4 = np.random.normal(0,0.5)
    rand5 = np.random.normal(0,0.5)
    rand6 = np.random.normal(0,0.5)
    demand_future = []
    for i in range(52):
        for j in range(4):
            random_demand = np.random.normal(3+rand1, 1.5+rand4)
            if random_demand < 0:
                random_demand = 0
            random_demand = np.round(random_demand)
            demand_future.append(random_demand)
        random_demand = np.random.normal(6+rand2, 1+rand5)
        if random_demand < 0:
            random_demand = 0
        random_demand = np.round(random_demand)
        demand_future.append(random_demand)
        for j in range(2):
            random_demand = np.random.normal(12+rand3, 2+rand6)
            if random_demand < 0:
                random_demand = 0
            random_demand = np.round(random_demand)
            demand_future.append(random_demand)
    demand_test.append(demand_future)

The DQN policy before domain randomization gives an average profit of $20163.84 on this test set, and the DQN policy after domain randomization gives an average profit of $21887.77, which still shows a performance improvement.

Conclusion

In this article, I focus on pointing out and addressing the gap between training/simulation and test/real world environments for the previous RL model I built for Inventory Optimization. It's good to see that techniques such as domain randomization in the robotics field can also be effective in the inventory optimization area. Experimental results suggest that it should be a good practice to manipulate historical demand data using techniques like domain randomization to enrich the training set, so that the RL model trained in a lab can better generalize to the real world.