Cem Sirin - The triple barrier method

Introducing the triple barrier method
1. Intuition on consistency
2. Intuition on barrier heights(#intuition-on-barrier-heights
Implementation on R
References

There are some existing posts/articles about the triple barrier method. They are not very detailed, and I will write a quite indepth one here. So, I will go over the method very exhaustively. I hope that you can just Ctrl + F (or Cmd + F on Mac) to find the part that you are interested in.

Introducing the triple barrier method

The triple barrier method is a method to label data for asset price prediction. The method is coined by De Prado (2018), and it has been gaining popularity in the quantitative finance community. The method may seem foreign for the academic community, but it is actually what day traders have been doing manually for a long time. When a day trader places an order, they set a certain price point to trigger automatically to either take their profit or stop their loss.

Example of a stop loss and take profit order, explained on Forex Springboard

Above you can see the 2 horizontal barriers that are placed to trigger exit orders. You may ask where is third barrier of the triple barrier method. The third barrier will be a vertical barrier, which is the time horizon \(h\). The time horizon is the maximum time that the position can be held. The label \(y_t\) for a given time \(t\) is defined as follows:

\[ y_t=\left\{\begin{array}{cl} -1 & \text { if hit upper barrier } \\ 0 & \text { if hit vertical barrier } \\ 1 & \text { if hit lower barrier } \end{array}\right. \]

Intuition on consistency

Now, let me wear my statistician hat and make some remarks. After we have our labels, we want to believe that the label \(y_t\) for time \(t\) is depends on some conditional probability distribution \(f(y_t|x_t)\) for given futures \(x_t\). Since, we are constructing our own labels, it relies on us to construct sound labels. If some orders trigger profit in 5 minutes and some in 5 days, the labels would not be consistent. By consistency, I mean that we should expect to have similar market conditions for each label class. Our predictive futures \(x_t\) captures the market conditions at time \(t\). If the market conditions are drastically different for each label class, the conditional probability distribution \(f(y_t|x_t)\) would not have much predictive power.

Intuition on barrier heights

There are other variants of the triple barrier method with dynamic barriers, but for the vanilla version, the barriers are static. Thus, in order to implement you need to have 3 ingredients for each time \(t\):

upper barrier \(u_t\),
lower barrier \(l_t\), and
horizon \(h_t\).

In practice, you can not have a fixed horizon \(h_t\) for each time \(t\), since markets are not 7/24 open. Based on the resolution of your data, you can either filter out the entry points that does not fit your time horizon, or you can use the maximum possible time horizon \(h_t = \max\{ h^{\text{max}}, h_t^* \}\), where \(h_t^*\) is the remaining time that the market closes

To calculate the upper and lower barriers, you can either use a (1) fixed value, or you can use a (2) volatility based value to calculate the barriers. I have seen cases where both methods are viable. The interpreatation of the both cases are crucial when you are in the feature engineering stage. Case (2) is scaled by the volatility, so may think of it as the effect of the volatility is removed. This could be advantageous if the trends in the market are volatility independent. Case (1) is sensitive to the volatility, so it may be more suitable if the trends in the market are volatility dependent. I usually use both cases and see which one performs better, but the crucial part is to interpret the labels correctly.

For Case (2), I have seen many people estimating volatility using prices, but I think it is better to use returns. Moreover, I also do not advise using the built-in standard deviation functions in your programming language. I take the square of the returns, and then I use the exponential moving average (EMA) to smooth the volatility. Taking the square of the returns means that you are asuming the expected returns are 0, which is how market operates for many assets (e.g. forex). In my experience, I have not seen much difference between using EMA or other moving average functions. The more important part is the span of those moving average functions. I usually use a span of 20, but this is another tuning parameter that you can play with.

Let \(\hat{\sigma}_t\) be the estimated volatility of the asset. The upper and lower barriers are calculated as follows:

\[ \begin{aligned} u_t &= p_t \cdot (1 + \hat{\sigma}_t * \alpha) \\ l_t &= p_t \div (1 + \hat{\sigma}_t * \alpha) \end{aligned} \]

where I consider \(\alpha\) as a tuning parameter. The higher the value of \(\alpha\), the less \(1\) and \(-1\) labels are generated.

For demonstration, I will code in R, since I am using Quarto to write this article, and it’s aesthetically pleasing to read. Let’s download Nikkei 225 data from Yahoo Finance and see how it works.

Implementation on R

library(tidyverse)
library(quantmod)
library(xts)

# download data
getSymbols("^N225", from = "2017-01-01", to = "2017-12-31")

[1] "^N225"

colnames(N225) <- c("open", "high", "low", "close", "volume", "adjusted")

# convert to tibble and drop missing values
alpha <- 3
N225 <- N225 %>%
    as.data.frame() %>%
    mutate(date = index(N225)) %>%
    select(date, high, low, close) %>%
    drop_na() %>% # drop missing values
    mutate(returns = close / lag(close) - 1) %>%
    mutate(volatility = EMA(returns ** 2, 20) ** 0.5) %>%
    mutate(u_bar = close * (1 + volatility * alpha), l_bar = close * (1 - volatility * alpha)) %>%
    drop_na() %>% # drop missing values
    as_tibble() # convert to tibble
# take a look at the data
head(N225)

# A tibble: 6 × 8
  date         high    low  close   returns volatility  u_bar  l_bar
  <date>      <dbl>  <dbl>  <dbl>     <dbl>      <dbl>  <dbl>  <dbl>
1 2017-02-02 19171. 18867. 18915. -0.0122      0.00993 19478. 18351.
2 2017-02-03 19061. 18831. 18918.  0.000191    0.00944 19454. 18382.
3 2017-02-06 19076. 18899. 18977.  0.00309     0.00903 19491. 18463.
4 2017-02-07 18971. 18805. 18911. -0.00347     0.00866 19402. 18420.
5 2017-02-08 19009. 18876. 19008.  0.00512     0.00839 19486. 18529.
6 2017-02-09 18991. 18875. 18908. -0.00526     0.00814 19369. 18446.

Let’s visualize the data. At the moment I am testing out the library rtemis, a library for machine learning and visualization. It’s not on CRAN yet, but you can install it from GitHub.

library(plotly)
library(rtemis)
remotes::install_github("zac-garland/zgtools")
colorp <- rtpalette("imperialCol")
length(colorp)

[1] 26

colorp[17]

$lemonYellow
[1] "#FFDD00"

# plot
plot_ly(N225, x = ~date, y = ~close, type = "scatter", mode = "lines", name = "close", color = I(colorp[[24]])) %>%
    add_trace(y = ~u_bar, mode = "lines", name = "upper barrier", line = list(dash = 'dot'), color = I(colorp[[7]])) %>%
    add_trace(y = ~l_bar, mode = "lines", name = "lower barrier", line = list(dash = 'dot'), color = I(colorp[[16]])) %>%
    layout(title = "Nikkei 225", yaxis = list(title = "Price", type = 'log'), xaxis = list(title = "Date"))

label_3plb <- function(df, h = 10, alpha = 1) {
    # initialize: y, ambiguity
    df$y <- NA
    df$ambiguity <- NA

    for(i in 1:(nrow(df) - h)) {
        # find first cross
        u_cross <- which(df$u_bar[i] < df$high[i:(i + h)])[1]
        l_cross <- which(df$l_bar[i] > df$low[i:(i + h)])[1]
        
        if (is.na(u_cross) & is.na(l_cross)) {
            df$y[i] <- 0
        } else if (is.na(u_cross) & !is.na(l_cross)) {
            df$y[i] <- -1
        } else if (!is.na(u_cross) & is.na(l_cross)) {
            df$y[i] <- 1
        } else if (u_cross == l_cross) {
            df$ambiguity[i] <- TRUE
        } else if (u_cross < l_cross) {
            df$y[i] <- 1
        } else {
            df$y[i] <- -1
        }
    }
    return(df)
}
# label the data
N225 <- label_3plb(N225, h = 10, alpha = 3)
paste("Number of 1, 0, and -1 labels:", sum(N225$y == 1, na.rm = T), sum(N225$y == 0, na.rm = T), sum(N225$y == -1, na.rm = T))

[1] "Number of 1, 0, and -1 labels: 82 89 46"

paste("Percentage of ambiguity:", sum(!is.na(N225$ambiguity)) / nrow(N225))

[1] "Percentage of ambiguity: 0"

# drop ambiguity and missing values
N225 <- N225 %>% select(-ambiguity) %>% drop_na()
# make y column a factor
N225$y <- as.factor(N225$y)
# name the labels | buy, hold, sell
N225$y_n <- ifelse(N225$y == 1, 'Buy', ifelse(N225$y == 0, 'Hold', 'Sell'))

# plot
plot_ly(data = N225, x = ~date, y = ~close, type = 'scatter', mode = 'lines', color = I('gray'), showlegend = FALSE) %>%
    add_trace(x = ~date, y = ~close, color = ~y_n, type = 'scatter', mode = 'markers', marker = list(size = 10, opacity = 0.8), showlegend = T, colors = c('green', 'black', 'red')) %>%
    layout(title = 'Nikkei 225', xaxis = list(title = 'Date'), yaxis = list(title = 'Price', type = 'log'), legend = list(title = list(text = 'Label')))

# blockker

References

[1] De Prado, M. (2018). Advances in financial machine learning. John Wiley & Sons.

Contents

Introducing the triple barrier method

Intuition on consistency

Intuition on barrier heights

Implementation on R

References