Using Q-Learning to find the optimal price

Published in

Docplanner Tech

6 min readJul 16, 2020

Increasing profit is one of the main objectives of a company. One of the strategies to achieve this is to offer a fair and competitive price for your products or services, which translates into increased demand. But how do you get this “optimal” price, where the number of sales and the price maximizes profit?

The answer is not so simple. Even having a price for a specific moment in time does not mean that this price will not change in the future due to market conditions. In this article, we investigate the use of a Reinforcement Learning (RL) technique to estimate the optimal price of a product that maximizes profit.

RL provides several features that make it attractive to solve this type of problem. It is able to learn from recent experiences and adapt to slow market changes, as well as maintain a balance between the short and long term without losing sight of the objectives specified in the model.

Among the algorithms used in RL, we have selected Q-Learning to help solve the problem of our optimal price. This model will allow us to approximate the expected value and continue learning continuously.

We compare the performance with a Monte-Carlo simulation.

RL concepts

RL is an area of artificial intelligence that studies the decision making of software agents to maximize a reward. The environment is typically stated in the form of a Markov decision process (MDP),

The basic reinforcement learning model consists of:

A set of states S
A set of actions A
Rules of transition between states.
Rules that determine the immediate reward of a transition.
Rules that describe what the agent observes (environment).

RL algorithms keep a balance between exploitation and exploration. Exploitation allows agents to take actions that a prior were good and continue to obtain rewards; while exploration allows them to discover new states and take actions with unknown rewards. In this way, the RL algorithms try to define a policy to take “good” actions based on past experiences.

There are two main types of RL algorithms, model-based and model-free. Model-based are those that use a transition function to estimate the optimal policy. On the other hand, model-free ignore the model and use sampling to estimate the rewards.

One of the best known model-free algorithms is Q-learning.

Q-Learning

Q-Learning is one of the most commonly used techniques in RL. It is a type of algorithm known as values-based that uses the Q function to find the optimal action. The fundamental objective is to maximize the value of the Q function, in such a way that it helps us to take the best action for each state.

The value of Q(s, a) is updated using the Bellman equation. Q(s, a) will estimate the expected reward value. The formula that we will see next, shows how the value of the function Q(s, a) is iteratively updated.

Q-Learning converges to an optimal policy even if it is acting sub-optimally (off-policy learning). In other words, the updated policy (greedy) is different from the behavior policy.

As it is a trial and error algorithm, you have to explore enough at first to get good approximations to the expected value (E[X]). Eventually, exploration is reduced, and actions are exploited taking the most correct ones.

Modeling optimal prices problem with RL

So let’s model an MDP for the problem of finding the price that maximizes profit; we will call this price now on, the optimal price. The optimal price is the price at which a seller can make the most profit. In other words, the price point at which the seller’s total profit is maximized. In this case, we have assumed that there is an optimal price that is fair and attractive to potential customers and is directly correlated with the action of buying.

The states are the possible prices that the product could take, from the minimum value that we would be willing to sell, to the maximum value that we think is a “fair price”. These values will help the convergence of the method and will allow exploring only values that make sense.

The actions that we can take on our environment will be three: keep, increase, or decrease the price. In these last two, the unit for the increase and decrease the price has been defined as parts of the model variables.

Rewards were defined as the price at which the product was sold or zero if it was not purchased.

How to get the optimal price given Q?

So how to obtain the optimal price value once the Q has been updated ?. Well, it is enough to select an initial price that in our case would state sₒ and follow the action with the highest value max{Q (sᵢ)} while the action is different from keep the price.

We could also find the highest value of the “keep price” action for all states explored. Recall that the value of Q (s, a) is precisely the expected reward value.

Testing our agent

To train and test our agent, we have defined a random variable that follows the distribution function shown in Figure 1.0. This cumulative distribution function (CDF) is the probability that a customer buys a given product at a fixed price. We can observe that increasing the price decreases the probability of purchase. And when it is close to zero (free) the probability of it being purchased increases.

In Figure 2 we see the expected profit given the optimal price and the profit from selling the product based on the recommendation of our agent.

To simulate the optimal profit function, we used a Monte-Carlo simulation following the aforementioned distribution (Figure 1).

As we can see the agent seems to converge to the optimal profit.

Now let’s show how the agent did the price scan until he found the “optimal” price. The agent began the exploration with a totally unfair price that was adapted considering the purchase of the clients.

After finding the optimal price, the agent keeps exploring to try to improve profit (long term goal). This constant exploration will allow it to adapt to the environment.

Conclusions

Although there are other models to solve this price problem that maximize profit, RL has demonstrated its versatility.

We have observed that convergence can be slow and that this is a very negative factor when we are talking about prices. This is why it is recommended to start Q values based on previous experience and select an initial value with a price that we do not consider unfair.

Another important consideration is that given the nature of the Q-Learning algorithm, this model will not converge optimally for products whose price changes seasonally and abruptly.

If you enjoyed this post, please hit the clap button below :) You can also follow us on Facebook, Twitter, and LinkedIn.