Democratisation of AI and invent of big data technologies has disrupted the quantitative finance practice. Various ML and DL models provide the next generation of nonlinear and non-intuitive time-series modelling compared to the traditional econometric. The same applies to optimisation problems. Reinforcement learning provides an alternative approach to the stochastic optimisation which has been traditionally used in the context of portfolio management.

In this talk I will showcase an end to end asset management pipeline based on recent AI developments. I will show how to build an autonomous portfolio manager which learns to rebalance the portfolio assets dynamically. Not only does the autonomous agent use AI (say an actor-critic type of network as in DDPG) to learn, it can usually use various other AI components to help with the learning. In particular, I discuss how a predictive AI component, such as a nonlinear-dynamic Boltzmann machine, can improve the learning of the agent. This component uses AI to improve on well known autoregressive models to predict the prices of the assets in the portfolio for the next time step. These can then be fed into the learning agent to essentially restrict the exploration of the whole action-state space. I also discuss the possibility of using a data generating component (using GANs) to learn the conditional distribution of asset prices and then generate synthetic data to overcome the problem of limited historical data.

With all this, I plan to provide:

- Conceptual understanding of how AI/Big Data change the traditional quantitative finance practice.
- How a real example of an end to end data pipeline looks like and how different components of a complex model work together in a unified platform architecture

– Hi everyone, I am Nima Nooshi and I’m a customer success engineer at Databricks. Today, I will be talking about reinforcement learning applied to the problem of financial portfolio optimization. The ultimate goal would be to design and implement an automatic trading bot with Spark. However, I figured before I jumped into implementation, I go through the theory of portfolio optimization and how AI and in particular reinforcement learning can applied to it. I will be talking about the specific implementation in the following talks which I will be presenting in an upcoming spark in AI conferences. Well, today’s agenda is the following. First we define financial portfolio optimization and discuss how it has been approached with stochastic optimization methods started by Markowitz in the 50s. Then we talk about how we can actually translate the problem of portfolio optimization into a framework, which reinforcement learning can be applied to, namely a Markov decision process.

Here we’ll talk about some specifics of financial markets that bring some challenges in the general formulation of the MDP. After that, we’ll talk about model based and model free reinforcement learning and talk about how these models can be applied to the optimization problem, and we’ll discuss some of the pros and cons of those algorithms, which puts us in a good position to start with a decent implementation in the following sessions.

Okay, what is financial portfolio optimization? Well, it is a subcategory of the broader class of fund allocation problems. Imagine something like an index one which tries to construct a portfolio according to some defined index way. Or even a more straightforward scenario, an equally weighted portfolio, which rebalances the holdings of the portfolio each day such that each individual asset gets an equal share of the initial capital. These are basically some examples of capital based fund allocations.

We can think of another or order allocation like this which are not focused on the dollar amount of positions. For example, instead of allocating same amount of capital to each individual asset to come up with the equally weighted portfolio, one can think of a portfolio which is fully diversified with respect to some measure of risk. Here instead of having the same dollar amount for each position, we have equal amount of risk.

In general, one can optimize some function to construct a portfolio, right. Imagine that you want to have a portfolio, which can extract the maximum value of returns or you want to maximize the expected return through volatility units at the end of the investment timeframe, because you do not only care about the return, but you want to manage risks. So, we can formulate the generalist problem as follows, like given a set of assets like M assets with a defined price history and an initial portfolio or initial endowment that you have at the at the time zero, then you want to find an allocation which maximizes some objective function, which is denoted by gamma here. This function could be simply expected return, which actually maximizes your wealth at the end of the investment period. Or it could be expected return divided by some measure of risk like one volatility or any other function which basically describes the investment goal. Objective functions are, in general, some function of the projected distribution of the asset returns at the end of the investment horizon. This means that the problem of portfolio optimization naturally is a stochastic optimization problem One can solve it, like in two ways yeah, you can think about in a static optimization where you project the estimation of the next period returns all the way forward to the end of the investment horizon. And you come up with the distribution of your returns at the end of the investment horizon and solve the optimization just once for that, or you can estimate the next period returns dynamically and solve the optimization problem sequentially, to basically set up a dynamic optimization prior to that.

It was Markowitz, who pioneered the attempts to solve the sarcastic optimization problems from the last line. His framework is actually consists of two steps. The first step no matter which objective function you have, you just need to solve the mean variance optimization problem, it is a constraint quadratic optimization, which tries to minimize the portfolio variance while constraining the expected portfolio return at a target level. Solutions of this with different values of target mean value, define a curve in the mean variance plane which is called the efficient frontier. Once you have the efficient frontier, one can solve a one dimensional optimization problem of optimizing the custom objective function, along that curve in the mean variance plane. So, one way I think, to use this method to periodically solve the static optimization problem, to make it a dynamical setup, right. Well, it seems that it is a viable option, unless you try it in the real world, because in the real world you have the transaction costs. The problem is that the solution to the optimization for two periods might be so far away from each other, that rebalancing the portfolio and the costs of transactions lead to a sub optimal policy. In fact, myopic optimal actions can cause sub-optimal cumulative rewards at the end of the period.

So that now that we talked a little bit about the portfolio optimization problem and how it was formulated in terms of the scholastic optimization and the attempts that were made to basically solve those those problems, we can discuss now a little bit about how we can formulate the portfolio optimization as Markov decision process and apply some of the methods in the reinforcement learning to solve the portfolio optimization problem. So first of all, how is an MDP define right?

Let’s assume a setup where at each time step an agent starts from an initial state takes an action, which is some kind of interaction with environment. And the environment gives a reward to the agent and changes the state. Right? If the state transition probability, which is determined by the environment is only a function of the current state and not all the history update up to this point of the time, the dynamical system is called Markov decision process.

Okay, how does it look like for a trading agent, Right. At the beginning of each period, the agent has to rebalance the portfolio and come up with a vector of the asset holdings. This basically defines the action so the action of a trading bot would be directly the portfolio weights that he’s coming up with at the end of each period. What about the reward, what is the reward function of the environment? In general, identifying reward is a little bit more challenging.

And what is rewarded reward basically, is a scalar value, which fully specifies the goals of the agent, and maximization of the expected cumulative reward over many steps will lead to the optimal solution of the task.

Let’s, look at some examples.

Taking games for example, the goal of the agents very well defined in games, Right? Either you win or you lose a game, and it could be well divided into separate reward signals for each time step. If you win a game, at the end of the step, you get a reward of one, if you lose a game at the end of the time step you get a reward of minus one for example, and you get a reward of zero otherwise, so very well defined and very well divisible into separate time steps. However, take a trading agent for example, who wants to maximize the return, but at the same time do not want to expose this fun to extreme market downtrends and crashes. He does it for example, by managing the value at risk in this portfolio, so that the objective of the agent is clearly defined by dividing this objective into sequential reward signals might be a very challenging task.

Now, let’s talk about the state and observation. At any step, we can only observe asset prices, and the observation is given by the prices of all assets, this is clear. We also know that when one period prices do not fully capture the state of the market, so this is something which is now I mean, you cannot basically predict the whole state of the market but just looking at the prices of yesterday for example. This makes financial markets a bit more challenging, and they, and in general financial markets are not a fully observable Markov decision process. And they’re just partially observables, because we can only as agents observe the prices.

So what it means is that the state that an agent has is completely different from the state of the environment. And there are some solutions to basically build this, then the whole environment state from the state of agent. The most obvious solution is we can build this set of environment from the whole history of the observation, which is basically not scalable. Or, we can approximate the environments day by some parametrized function of past observations.

When we’re working with time series, as we’re as doing that in financial markets, it is natural to to assume that the state generating function is not only a functional observations, but also a function of the past energy and states, right. So, we think of some models, which has some kind of memory.

Let’s look at some of the examples, Garch Models, so these are the models which are widely used in quality financing, they are basically constructed in this way. Assume that the state of the market at each time can be fully represented by the volatility of individual assets. This is the assumption that basically says, if you know the volatilities, you know the full state of the market. If you assume that Garch Models can build a rather simple mapping of past volatilities and carnivals observations which are the prices to generate the volatilities for the current time step, and therefore they can fully build the state of the market, from the observations and past from the past observation and past states.

We can look at an account and look at other models like in continuous domain Stochastic Volatility Models they do to same they are basically built volatilities, which are hidden states of the market, but by just fitting a kind of stochastic process to the volatilities. In this way, they’re able to basically generate the hidden states which are the volatilities and generate a full representation of the market.

But obviously one can use more sophisticated featurization of the hidden environments or even state of the market. So, it shouldn’t be as simple as just volatilities one can have a complicated representation of that and neural networks, for example, can build those kind of complicated models of the market state. But the common thing among all of these models is that their state of the environment is build using the past observations and past states and the state of the agent at the current time is not enough to basically come up with the whole state of the financial market or basically the returns for the next period.

Okay, now that we talked about the MDP formulation of portfolio optimization a little bit, I want to go through some of the main components of the reinforcement learning this form to basically put us in the position to come up with some algorithms that we want to eventually be implementing using reinforcement learning.

Policies, so policies simply mapping from a state which an agent experience to an action that he takes, it could be deterministic policy, which means that if an agent finds himself in a certain state, he will always take a certain action or it could be a probabilistic policy which means that he will choose a certain action from a spectrum of all possible actions with some predefined probability.

Concept of value function. So, what is value function? Value function is defined as the expected amount of reward, one can get from an MDP starting from the state and following a certain policy.

For example, if we define the reward of a trading bot to be just log returns of portfolio returns, at the end of each time step, the value function would be the expected amount of cumulative return at the end of the investment horizon.

And models, what are models? Models are just agents a representation of the environment and it defines the transition probabilities of the states in deployment. For example, if you assume that the next step returns of the financial time series following Gaussian distribution, the model of environment is fully defined via the transition probability of a Gaussian distribution.

So now that we have all the ingredients in place, we want to talk about the model based reinforcement learning in portfolio optimization, how the setup looks like and how we can basically build algorithms based on this setups. We start from our familiar MDP setup when an agent interacts with the environment and gets rewards based on the action he takes. But now the idea is that the agent first tries to learn the model of environment from the transition he has been experiencing. So he’s not going to optimize the policy directly from the experience, but he first tries to learn some model from the transitions that he’s been experiencing. And then based on that model, he will try to to solve kind of optimization. So, at each time step, the agent first predicts the next day because he has a model for the environment. So, he predicts the next step and the rewards he will be getting based on the action he took, that he observed the real transition and the real rewards that he got from deployment and then he can basically incrementally update his model because he has a model and he has a loss function that he can basically train the model upon. So, what are the advantages of that kind of paradigm. So, there are some advantages especially in financial portfolio optimization. The most important one is that there has been a lot of studies about the behavior of financial markets and the properties of the financial time series data. It is very easy to basically implement those findings directly into a model based reinforcement learning paradigm, So, you basically can put all those findings explicitly into a model, and then have a model that best describes the financial market transitions. So, things like what volatility clustering seems like heavy tails of the returns, tail dependence among different assets, existence of jumps and non-stationary can be directly modeled and learn from the data. But then obviously, there are some disadvantages, because you have an explicit model that you have to first and to learn there are some sources of errors and approximations coming in . So you first have to learn a model and if your model is not a an accurate representation of the environment, the optimal policies that you learn based on that model won’t be optimal at all because you have a model which cannot or is not basically describing the market as good as it can or it should.

So let’s formulate everything that we’ve been talking about the model based reinforcement learning. What should we do?

Well, in general, if you want to basically use reinforcement learning or model based reinforcement learning, we need to gather some experience by interacting with the environment and figuring out the model from those experiences that we have been gathering right.

But in finance, it is a little bit much easier because the interactions that we make with environment, which are basically the the transactions that we make, do not affect the state transitions. What do you mean by that is that any time that we buy or sell any asset in the market, we can assume that this kind of transaction does not change the prices, so that we can basically separate the whole action from the whole transition, and we will have a setup, which has only the transition of the prices, so basically we can look at the history of the prices or the returns, and we can basically train a model based on that or supervised model based on that. So, the whole approach will look something like this, would pick a parameterized model, which predicts the next state transitions or comes up with the probability distribution of the next time period returns, we pick an appropriate loss function, so that we can train that model. And then we find the parameters, which minimize our loss function, and we basically can train the whole model on our data set.

Let’s put all this into a generalized algorithm that we can use for any type of model based reinforcement learning. The input to this algorithm is simple, you have your trading universe or an assets basically have to define what kind of assets you wanna trade in, you need to define the parametric model, which you think predicts that the returns of the market the best, and you need to come up with a loss function, which describes the deviations of the model predictions from the observed returns. For example, you can have a normal Garch Model with non Gaussian innovations. And the corresponding loss function would be a likelihood, you could use maximum likelihood estimation on a batch data set basically to first initialize the model or learn the parameters of the model and then jump into a kind of online training set of the reinforcement learning, so the rest of algorithm is simple. You use the batch data that you have basically gathered this is simply your history of the prices.

You use that to learn the parameters of the model. And then you start to, iterate over the time steps, you start to predict the next step from the model that you basically have learned on your batch data. You observe the returns and the state, not the state, you observe the return and the prices of the returns by just stepping forward. You build your state from the observation and the state’s history that you have been gathering.

This is part of your model, so basically, part of your models is responsible for a building environment state from the observation that he has been making. And then you calculate the deviation from the state that you observed or basically built upon the observations that you made and the state that you have been predicting.

And then incrementally learn the parameters or change the parameters based on the gradient of that loss function and rebalance the portfolio based on the model that you have, you can use any kind of model based control basically to solve the optimization problem. So, as soon as you have a model for your environment, you basically can sample from that model for example, in your multipolar setup, so you can have a sample of all returns up the end of the investment horizon and basically start to estimate the the objective function like expected returns, volatilities whatever you get you basically put as an object as a function of objective function to optimize for and then solve the optimization problem for that, what we call a sample. Soon as you have the model view, basically can control your policy basically using different policy matters or just refer to quadripolar and then reiterate, until you basically converge. That this is a whole paradigm whole scheme of using model based reinforcement learning to learn the model at the same time, use that model to plan and come up with the optimization at the same time.

Okay, instead of learning a predictive model of the transitions first, and then use that model to come up with the optimal policy.

What can you start to learn the optimal policy based on the value function directly? Right. Assume that the value function can be defined As the cumulative return, you will be getting at the end of the investment horizon. Then you can use a generalized function for the policy which parameterizes how you rebalance the portfolio at each time step and then at the same time, you can use another function to parameterize the amount of cumulative return you will be getting if you rebalance the portfolio accordingly.

So, this is a typical actor critic setup, which is one of the state of the art methods model free reinforcement learning and could be directly applied to the problem of automatic trading bots using one of reinforcement learning. So, here in the graph, I have pictured basically how it could look like the network’s basically the Actor network will get the observation which are the prices.

Based on those observations, we will first build a state of the environment, and then use that state to come up with some action, which is basically portfolio weights. And then a Critic network at the same time, uses those weights, those portfolio weights, and of course, observations of the prices at the same time, he builds a state upon those again, and then come up with a value, how much that rebalancing will basically give you a cumulative return at the end of your investment. So basically, we’ll roll it out onto the end of the investment horizon and look at the returns that you will be getting and give you an estimation of the value we’re getting from that action that you pick from your Actor network. And these set up can be jointly trained with a different state of the art algorithms I just put a generic DDPG. So, deep deterministic policy gradient algorithm and it could be applied to this specific problem. And this is something that I will be trying to do alongside and model based reinforcement learning to be able to show how we can implement those in Spark and use some of the Spark features to basically paralyze those model trainings and come up with the ideas how a full implementation will look like. So, let’s briefly talk about the kind of challenges and the problems that all these kind of models that we’ve talked about have.

So, as I said before, I mean it is very important

and crucial to reinforcement learning algorithms and MDP formulation to have a clear reward function signal right?

It is kind of challenging for a generalized portfolio optimization framework to come up with reward function generators. If you have some sort of complicated risk functionals like valued risk, or any other quanto based risk measure of the portfolio returns, it might be a problem to basically engineer a reward generating function. The other thing is about the the environment, the financial market environment it is a very complicated environment. It is a lot of features, which basically make it very hard for the models to learn effectively. And on top of that, there is a general theme in financial markets so basically the the ratio, the signal to the noise, is pretty low compared to other areas,

which basically where reinforcement learning has been successfully applied to things like games, things like image processing,

text classification stuff like that. So, basically the nature of financial markets and the nature that these are very noisy environment makes it very hard for the reinforcement learning algorithms to learn it.

Added to those problems, there are some specific problems with model-free and model-based reinforcement learning financial markets.

For example, if you want to use model-free there are limited amount of trading data so you know that the financial times series if you for example, I want you to learn a model, which uses daily return data or daily prices, you basically have like 250 data points for a year. And then for I mean, if you want to train your model on history of 10 years, you will not have more than like 2000 to 3000 data points to basically train your model and this is a very, very small amount of data, which basically means combined with the fact that the financial markets are very noisy environment will make the models to be very prone to overfitting and not being able to generalize well for the out of sample data. In a model-based reinforcement learning, you could have some of the some of these specific characteristics of the financial markets model is explicitly into your into your algorithm, but then you will have to come up with ways to cope with model uncertainty changing into the models, inaccurate models and your hyper parameters of the models and these will directly affect your optimal portfolios and optimal solutions that you will have at the end of the day. So those are some ideas that can basically use the benefits of those both supposed rules, to make reinforcement learning kind of viable option for portfolio optimization kind of hybrid methods that you basically start to learn a model at the same time use model free to generate samples and augment data basically, to come to basically cope with that limited amount of training data problem and try to use Model-free reinforcement learning on the generated data from the model to get more accurate type of solutions, but these all has to be tested and taken very carefully into the account. So, this was my presentation. So, the first part I just wanted to give you a theory of how it will look like, what are the challenges of using reinforcement learning and trying to understand the theory behind it. And in the next part, I will be trying to implement it fully integrated solution based on reinforcement learning algorithms, different type of algorithms and as for to train an automatic trading bot,

which can basically come up with the optimal portfolios at the end of a certain investment horizon.

« back