We will now extend the ideas covered thus far to the cases where:

The data is a sequence of observations
The learned model is not only a passive predictor, but an agent that interacts with the environment.

Learning Dynamics¶

Consider an example data set

$S = \{x_1, \cdots, x_m \}$.

where the subscript defines a time order, i.e. $x_{t'}$ is an event that happens after $x_t$ if $t' > t$. Each observation $x_t$ describes the state of the observed dynamical environment at a time step $t$. Data sets comprising such sequences of state observations are called time series. In time series prediction problems, the task is then to learn the dynamics of the system that generates the data. Prime applications of time series prediction include:

Natural language processing: Generate text, translate text, answer questions, etc.
Option pricing: Predict the future price of a stock, a commodity, etc.
Robotics: Predict the future state of a robot, e.g., its position, velocity, etc.

Markov property¶

Due to the same reasons as before, we will assume that the data is generated by a random process, i.e. $x_t$ is a random variable. The joint distribution of the data is then given by: $P(x_1, \cdots, x_m)$. Remember the product rule $P(A,B) = P(A|B) P(B)$. Let us apply it to the joint distribution of the data:

\begin{align*} P(x_1, \cdots, x_m) &= P(x_1) P(x_2, \cdots, x_m | x_1) \\ & =P(x_1) P(x_2 | x_1) P(x_3, \cdots, x_m | x_1, x_2) \\ & \vdots \\ &= P(x_1) \prod_{t=2}^m P(x_t | x_1, \cdots, x_{t-1}). \end{align*}

Hence, in the straightforward case, all observations depend directly on their entire history. In most real-world applications, there is a way to choose the feature set in such a way that all the interesting properties of the past are represented in the feature set. For instance, instead of storing a full history of the price of commodity, we can store its average and peak prices. This way, we can break the dependency of the future on the past. This is called the Markov property:

$P(x_t | x_1, \cdots, x_{t-1}) = P(x_t | x_{t-1})$.

Applied to the joint distribution of the data given above, we get:

\begin{align*} P(x_1, \cdots, x_m) &= P(x_1) \prod_{t=2}^m P(x_t | x_{t-1}) \\ &= P(x_1) P(x_2 | x_1) P(x_3 | x_2) \cdots P(x_m | x_{m-1}). \end{align*}

Recurrent neural nets¶

If we are interested in using the Markov property but not interested capturing the uncertainty in the data, it would be sufficient to learn a function approximator that maps the features of the present to the features of one time step into the future:

\begin{align*} x_{t+1} &:= f_{\theta}(x_t,t). \end{align*}

When $f_{\theta}$ is chosen as a neural net, the resulting model is called Recurrent Neural Net (RNN). It is also common to learn to predict the change in the features instead of the features themselves:

\begin{align*} x_{t+1} &:= x_t + f_{\theta}(x_t,t). \end{align*}

When the neural net architecture is chosen naively as $x_{t+1} := W \tanh(x_t)$, the computational dependencies between variables appear as below:

\begin{align*} x_1 &:= W \tanh(x_0,0)\\ x_2 &:= W \tanh(W \tanh(x_0,0))\\ x_3 &:= W \tanh(W \tanh(W \tanh(x_0)))\\ &\vdots\\ x_N &:= W \tanh(W \tanh(\cdots W \tanh(x_0),\cdots)) \end{align*}

Remarkably, while the computation of $x_N$ does not depend on $x_0$ directly, it depends on it indirectly through all the intermediate computations. This is also why the Markov property is not such a major restriction in many practical instances.

One can also use an arbitrarily deeper architecture to model the per-time-step update. For instance, a one hidden layer architecture would be:

\begin{align*} x_{t+1} &:= W_1 \tanh( W_2\tanh(x_t)). \end{align*}

Latent state spaces and Seq2Seq design¶

There may also be cases where it is plausible to assume that the observations can be explained by a dynamical process that is not directly observable. Imagine that the motion of a rigid body object is monitored by a camera. The resulting video is a sequence of high-dimensional pixel intensities, say $1000 \times 1000$ pixel resolution and three color channels, amounting to a 3M dimensional feature space! However, the observed motion can be explained by physical laws that govern the positions (3D) and momentum (3D) coordinates of the object, amounting to 6 dimensions. The 6-dimensional state space is called the latent state space. Such latent spaces can be modeled by RNNs well. For instance, let $z_t$ denote the latent state at time $t$. Then, the RNN can be written as:

\begin{align*} z_{t+1} &:= W_s \tanh(z_t),\\ x_{t+1} &:= W_o \tanh(z_{t+1}). \end{align*}

In some applications, we are interested in feeding an input sequence $u_1, \cdots, u_T$ and expect a suitable output sequence $y_1, \cdots, y_{T'}$ which may potentially be not time-aligned, i.e. $u_t$ and $x_t$ may not correspond to a concurrent event. The sequences may even have different lengths, i.e. $T \neq T'$. A typical example is machine translation, where the input is a sentence in one language and the output is the same sentence in another language. In such cases, it is common to assume that there is a universal grammar governing the latent dynamics that generate word sequences. Such an RNN would look as below:

\begin{align*} z_{t+1} &:= W_s \tanh(z_t) + W_i \tanh(u_t), \qquad t = A, \ldots, B \\ x_{t+1} &:= W_o \tanh(z_{t+1}), \qquad t = C, \ldots, D. \end{align*}

This is called the seq2seq design. The terms such as $W_i \tanh(u_t)$ that map an input space to a latent space are called encoders, and the terms such as $W_o \tanh(z_{t+1})$ that map a latent space to an output space are called decoders. Real-world applications use more complex architectures than the example above, but the basic idea is the same.

A common choice in machine translation applications is $C=B+1$, i.e., read the whole sentence first and translate afterwards.

In [2]:

from IPython import display
display.Image("fig/seq2seq.png")

Out[2]:

No description has been provided for this image

In a seq2seq model, the only information the decoder has about the encoder is the last hidden state $x_B$ of the encoded sequence. This causes big information loss especially for long sequences. This is called the bottleneck problem. The problem is addressed by attention mechanisms, the most popular of which is the Transformer architecture. The term gives the name to the letter "T" of the famous ChatGPT. The Transformer architecture is beyond the scope of this course, but the interested reader is referred to the original paper.

Training RNNs: Back-propagation Through Time (BPTT)¶

Given an observed ground-truth sequence as $x_1, \ldots, x_T$ and predictions made by an RNN as $\widehat{x}_t = f_\theta(\widehat{x}_{t-1})$, our goal solve:

\begin{align*} \arg \min_{\theta} \sum_{t=1}^T \frac{1}{2}(x_t - \widehat{x}_t)^2 \end{align*}

by gradient-descent. This is direct application of the mean-squared loss we used in regression to the time series prediction problem. The gradient of the loss with respect to the parameters is:

\begin{align*} \theta &:= \theta- \gamma \sum_{t=1}^T (x_t - \widehat{x}_t) \frac{d \widehat{x}_t}{d\theta}\\ &= \theta - \gamma \sum_{t=1}^T (x_t - \widehat{x}_t) \sum_{j=1}^t \Bigg ( \prod_{k=j+1}^t \frac{\partial \widehat{x}_k }{\partial \widehat{x}_{k-1}} \Bigg ) \frac{\partial \widehat{x}_j }{\partial \theta} \end{align*}

Note that we unroll the RNN multiple time steps forward to collect gradient signal from each individual time step. These gradient signals are then combined to update a single set of parameters that are shared across all time steps. This approach, called Back-Propagation Through Time (BPTT), is in the essence of how RNNs capture regularities that overarch single time steps. On the downside, the fact that the gradient is a product of many terms makes rapidly grow or shrink depending on the properties of the Jacobian term $\frac{\partial \widehat{x}_k }{\partial \widehat{x}_{k-1}}$, causing severe numerical instability problems when $T$ is large. This is called the vanishing/exploding gradients problem. This problem is addressed by gated units that encapsulate a portion of the gradient signal in a newly introduced term and its parameters, called gates. Most popular gated units are Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). Attention models and their most famous variant, the Transformer mentioned above, also build on the same principle.

Let us see an example recurrent neural net implementation below. The data set comprises a sequence of 100 observations coming from the following process:

\begin{align*} x_t = \sin(2\pi 0.1 t ) + \epsilon_t, \qquad \epsilon_t \sim \mathcal{N}(0,0.04) \end{align*}

for $t \in \{ 0.0, 0.01, \ldots, 9.99, 10.0 \}$.

In [3]:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic time series data
def generate_time_series(n_points, frequency=0.1):
    t = np.linspace(0.0, 10.0, n_points)
    x = np.sin(2 * np.pi * frequency * t) + np.random.normal(0, 0.2, n_points)
    return x

# Create a dataset
n_points = 100
data = generate_time_series(n_points)
data = torch.from_numpy(data).float()

# Split the data into train and test sets
n_train = int(n_points * 0.7)
data_tr = data[:n_train]
data_ts = data[n_train:]

class RNN(nn.Module):
    def __init__(self, num_covariates, hidden_size=256):
        super(RNN, self).__init__()
        self.nn = nn.Sequential( nn.Linear(num_covariates, hidden_size), 
                    nn.Tanh(), nn.Linear(hidden_size, num_covariates) )

    def forward(self, Data_0, Data):
        x = Data_0.view(1, -1)
        x_pred = torch.zeros(Data.shape[0])

        for ii in range(Data.shape[0]):
            x = self.nn(x)
            x_pred[ii] = x
        return x_pred

n_hidden = 256
learning_rate = 0.001
num_epochs = 5000
horizon = 6

model = RNN(num_covariates=1, hidden_size=n_hidden)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    loss = 0.0

    idx = np.random.randint(0,n_train-horizon)
    
    optimizer.zero_grad()
    loss = criterion(data_tr[idx:idx+horizon], 
                     model(data_tr[idx], data_tr[idx:idx+horizon]))
    loss.backward()
    optimizer.step()

# Testing loop
with torch.no_grad():
    err = 0
    Nts = data_ts.shape[0]
    real_data = torch.zeros(Nts)
    pred_data = torch.zeros(Nts)
    for ii in range(0, Nts, horizon):
        Data_ii = data_ts[ii:ii+horizon]
        if Data_ii.shape[0] < horizon:
            break
        Data_pred = model(Data_ii[0], Data_ii[:-1])

        real_data[ii:ii+horizon] = Data_ii
        pred_data[ii] = Data_ii[0]
        pred_data[ii+1:ii+horizon] = Data_pred
        err  += criterion(Data_pred,Data_ii[1])

    print(err.item()/Nts)

plt.figure()
plt.plot(np.arange(0, n_train),
         data_tr, label='training data')
plt.plot(np.arange(n_train,n_points), 
         data_ts, label='test data')
plt.plot(np.arange(n_train,n_points), 
         pred_data, label='predictions', linewidth=3)
plt.legend()
plt.show()

/Users/kandemir/opt/anaconda3/envs/py10/lib/python3.10/site-packages/torch/nn/modules/loss.py:530: UserWarning: Using a target size (torch.Size([])) that is different to the input size (torch.Size([5])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)

0.00872806211312612

Reinforcement Learning¶

We have thus far used learning models for recognition and prediction tasks. One can also use machine learning to build intelligent agents. An agent is a model that can change its environment by taking actions. This way, the model not only observes the environment but also interacts with it. The framework that studies such models is called reinforcement learning.

In [6]:

from IPython import display
display.Image("fig/sl-vs-rl.png")

Out[6]:

Following the Markov property, we modeled the transition of a time series from one time step to next as $P(x_{t+1} | x_t)$. In order to allow the agent to influence the change, it suffices to condition also on the action $a_t$ taken by the agent at time $t$, giving rise to the transition model:

\begin{align*} P(x_{t+1} | x_t, a_t). \end{align*}

This transition model characterizes the environment of the agent to be trained.

The dependency of the state transitions on actions of an agent is the first key difference of the reinforcement learning setup from from time series prediction. The second key difference is that the environment responds to the agent's actions by giving rewards, values that quantify the degree of plausibility of the agent's action at a given state. For instance, in an Atari game, the reward could be the score of the game.

In [7]:

from IPython import display
display.Image("fig/rl-block.png")

Out[7]:

In its most generic form, the reward function is a probability distribution as below:

\begin{align*} P(r_t | x_t, a_t). \end{align*}

In practice, the dependency on the taken action may be dropped or a deterministic reward function may be assumed. The reward function characterizes the supervision signal given to the agent. The agent is trained to take actions that will bring as high total reward as possible during its interactions with the environment. In formal terms, the agent learns a policy $\pi(a_t | x_t)$ that maps states to actions. The policy may be a probabilistic rule, hence $\pi$ may correspond to a probability distribution. It may also be a deterministic rule, in which case $\pi$ is a function.

The agent has to act optimally while it is interacting with the environment. It can do so by updating its policy based on its own evaluation of its performance at any time step $t$. It uses a metric, called the return, which is the discounted sum of rewards it expects to receive in the future:

\begin{align*} G_t &= r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4}, + \ldots \\ &= \sum_{t=0}^{\infty} \gamma^t r_{t+1} \end{align*}

where $\gamma \in [0,1)$ is a discount factor. The discount factor is used to account for the fact that the return estimate is not perfect. The sooner the reward is received, the more certain it is. One can of course truncate the sequence on the right hand side if the duration of the environment interactions is known beforehand.

Notice the remarkable fact that the return can be expressed recursively:

\begin{align*} G_t &= r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4} + \ldots \\ &= r_{t+1} + \gamma \left ( r_{t+2} + \gamma r_{t+3} + \gamma^2 r_{t+4} + \ldots \right ) \\ &= r_{t+1} + \gamma \underbrace{\left ( r_{t+2} + \gamma r_{t+3} + \gamma^2 r_{t+4} + \ldots \right )}_{G_{t+1}} \\ &= r_{t+1} + \gamma G_{t+1} \end{align*}

The agent's goal is to maximize the expected return:

\begin{align*} \arg \max_{\pi} \mathbb{E} \left [ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \right ]. \end{align*}

where the expectation is with respect to all factors of randomness, such as the transition model, the reward function, and the policy.

Assume that the agent has taken action $a_t$ when the environment was at state $x_t$, however the next state $x_{t+1}$ is not yet observed. For instance, a chess player saw the board position $x_t$, played move $a_t$ and waits for the opponent to move. Then the expected return can also be expressed recursively as

\begin{align*} Q(x_t, a_t) &= \mathbb{E} \left [ r_t + \gamma G_{t+1} | x_t, a_t \right ] \\ &= r_t + \gamma \mathbb{E} \left [ G_{t+1} | x_t, a_t\right ] \\ &= r_t + \gamma \mathbb{E} \left [ r_{t+1} + \gamma G_{t+2} | x_t, a_t\right ] \\ &= r_t + \gamma \mathbb{E}_{p(x_{t+1}|x_t, a_t)} \left [ \mathbb{E} \left [ r_{t+1} + \gamma G_{t+2} | x_{t+1} \right ] | x_t, a_t \right ] \\ &= r_t + \gamma \mathbb{E}_{p(x_{t+1}|x_t, a_t)} \left [ \mathbb{E}_{\pi(a_{t+1} | x_{t+1})} \left [ \mathbb{E} \left [ r_{t+1} + \gamma G_{t+2} | x_{t+1}, a_{t+1} \right ] | x_{t+1} \right ] | x_t, a_t \right ] \\ &= r_t + \gamma \mathbb{E}_{p(x_{t+1}|x_t, a_t)} \left [ \mathbb{E}_{\pi(a_{t+1} | x_{t+1})} \left [ Q(x_{t+1}, a_{t+1})| x_{t+1} \right ] | x_t, a_t \right ] \end{align*}

This recursive relationship between $Q(x_t, a_t)$ and $Q(x_{t+1}, a_{t+1})$ is called the Bellman equation. We will use this equation to convert the reinforcement learning problem to a regression problem below. To remind, $Q$ is the agent's own estimate on the value of being at state $x_t$ and taking action $a_t$. If this estimate is exact, then the agent can maximize its expected return by taking the action that maximizes $Q(x_t, a_t)$ at each time step:

\begin{align*} \pi(a_t | x_t) = \arg \max_{a'} Q(x_t, a'). \end{align*}

This is called a greedy policy. Plugging the greedy policy function into the Bellman equation, we get:

\begin{align*} Q(x_t, a_t) = r_t + \gamma \mathbb{E}_{p(x_{t+1}|x_t, a_t)} \left [ \max_{a'} Q(x_{t+1}, a')| x_t, a_t \right ] \end{align*}

However, in the course of training, we only have an inaccurate estimate of $Q$. Let us denote it as $Q_{\theta}$, a function approximator such as a neural net parameterized by $\theta$. We would like to update $\theta$ in such a way that the Bellman equation holds as closely as possible:

\begin{align*} \arg \min_{\theta} \left ( Q_{\theta}(x_t, a_t) - \left ( r_t + \gamma \mathbb{E}_{p(x_{t+1}|x_t, a_t)} \left [ \max_{a'} Q_{\theta}(x_{t+1}, a')| x_t, a_t \right ] \right ) \right )^2. \end{align*}

This is a regression problem. The loss function is called the Bellman error. The only obstacle for implementing this loss is the expectation with respect to the transition model $p(x_{t+1}|x_t, a_t)$. We can approximate this expectation by simply sampling from the transition model by taking actions with respect to some policy $\mu$, storing the $(x_t, a_t, r_t, x_{t+1})$ tuples in a data set $S$, and doing gradient-descent updates on the evaluation of the Bellman error on minibatches of the data set. The resulting algorithm is called Q-learning. The algorithm is summarized below:

Initialize $Q_{\theta}$ randomly
Reset environment to initial state $x$
Initialize $S = \emptyset$
For $t = 1, \ldots, T$:
- Sample $a$ from $\mu(a | x)$
- Act $a$ in the environment and receive $r$ and $x'$
- Store $(x,a,r,x')$ in $S$
- Sample a minibatch $B$ from $S$
- Update $\theta$ by gradient-descent on the Bellman error on $B$: \begin{align*} \theta &:= \theta - \alpha \sum_{(x_t, a_t, r_t, x_{t+1}) \in B} \left ( Q_{\theta}(x_t, a_t) - \left ( r_t + \gamma \max_{a'} Q_{\theta}(x_{t+1}, a') \right ) \right ) \nabla_{\theta} Q_{\theta}(x_t, a_t). \end{align*}
- Set $x := x'$
Return $Q_{\theta}$

It may not be obvious in the first sight why we need a separate behavior policy. The reason is that the greedy policy chooses the optimal action by treating the current estimate of $Q$ as perfect. This is called exploitation. However, in cases where it is not, the greedy policy may get stuck in the areas of the environment where the return is suboptimal. It may discover states with higher rewards than the ones visited thus far by taking random actions every now and then. This is called exploration. The agent is in fact trying to solve an ill-posed problem: It learns to act optimally based on its experiences, but its experiences are shaped by its past actions. This is called the exploration-exploitation dilemma. The behavior policy is used to balance exploration and exploitation. The most popular choice is the $\epsilon$-greedy policy:

\begin{align*} \mu(a_t | x_t) = \begin{cases} \arg \max_{a'} Q_{\theta}(x_t, a') & \text{with probability } 1-\epsilon \\ \text{random action} & \text{with probability } \epsilon. \end{cases} \end{align*}

Such reinforcement learning algorithms that use a different behavior policy than the greedy policy are called off-policy algorithms. The ones that use the greedy policy are called on-policy algorithms. The Q-learning algorithm above is an off-policy algorithm.

The data set $S$ used above is typically implemented as a fixed-sized queue, i.e. a Last-In-First-Out (LIFO) buffer. This approach is called experience replay and the data set is called a replay buffer. The motivation for using a fixed-sized queue is that in many cases the behavior policy is derived from the current estimate of the target policy, as in the case of the $\epsilon$-greedy approach. Hence, the data points collected in the far past are no longer representative of the dynamics resulting from the current estimate.

A sample implementation is given in the additional Jupyter notebook shared in itsLearning.