Hello to Time Series in R

Xinbin Huang

2019-02-26

Setup Guide

  1. Install R 3.5: https://cran.r-project.org/
  2. Install RStudio Desktop (Free): https://www.rstudio.com/products/rstudio/download/#download
  3. Download and extract the code to this talk: https://github.com/xinbinhuang/r-time-series/archive/master.zip
  4. Click on the .Rproj file
  5. packrat::restore()

Who Am I?

  • Alumni of the UBC Master of Data Science Program
  • Co-organizer of School of AI
  • Data Scientist at LiveStories

Acknowledgements

Why R? (not Python sorry)

  • R with Statistical Analysis in mind.
  • A lot of packages/models in R that Python simply doesn’t have.
  • CRAN Time Series Task View

What is Time Series?

  • Data recorded in time order
  • Time as the primary axis

https://blog.timescale.com/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/

Why should I care?

  • Economic forecasting
  • Target setting
  • Budget/inventory management
  • Sensors
  • Monitoring/Log data

How do we do Time Series Analysis?

  • Exploratory Data Analysis (EDA) to check if the data is staionary
  • Transform the data if needed
    • Box-Cox Transformation
  • Model Selection & Validation
  • Forecasting

“Jargons”

  • Trend: a long-term increase or decrease in the data.
  • Seasonality: a periodic pattern over a fixed period
  • Cyclic: rises and falls that are not of a fixed frequency.
  • Stationary: properties do not vary across time

Stationary or NOT

Non-staionary

https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/

Stationary

https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/

The Dataset

  • Time series version of the Titanic Dataset
  • Airline Passanger
    • Monthly totals of international airline passengers, 1949 to 1960.
head(AirPassengers, 12)
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118

EDA & Processing

AirPassengers %>% tsdisplay()

Decomposition

AirPassengers %>% 
    stl(s.window = 'periodic') %>% 
    autoplot()

Model Fitting

  • Time-series train-test split
  • Time-series cross-validation
# keep the last year as test set
train <- window(AirPassengers, end=1959+11/12)
test <- window(AirPassengers, start=1960)

Models

  • Linear regression
  • ARIMA
  • Exponential smoothing
# try different hyperparameters!

# linear regression
mod1 <- tslm(train ~ trend + season) 
# ARIMA
mod2 <- auto.arima(train) 
# exponential smoothing
mod3 <- hw(train, seasonal = 'multiplicative') 

Model Validation

  • Make predictions on the test set
  • Validate model performance with the test set
  • Don’t use training set to validate your model performance
    • OVERFITTING
h <- length(test)

# benchmark prediction: use value in the previous period
benchmark <- naive(train, h=h) 

mods <- list(
    lm = mod1,
    arima = mod2,
    exp_smooth = mod3
)
# make predictions on test set
preds <- map(mods, ~ forecast(.x, h=h))

Validation

  • Use score metric to compare models
# RMSE score for comparison
accuracy(benchmark, test)[, 'RMSE']
## Training set     Test set 
##     31.33213    102.97653
map(preds, ~ accuracy(.x ,test, h=h)[, 'RMSE'])
## $lm
## Training set     Test set 
##     22.35917     49.47908 
## 
## $arima
## Training set     Test set 
##     9.907267    23.931703 
## 
## $exp_smooth
## Training set     Test set 
##     9.949946    16.583116

How do they look on a plot?

Forecasting!!!

# train on the whole data set
final_mod <- hw(AirPassengers, seasonal = 'multiplicative')

# let's have a 3 year future forecast
h <- 12 * 3

fc <- forecast(final_mod, h)

Forecast plot

Prophet

  • A robust forecasting framework by Facebook Data Science team
  • Best with strong seasonal effects and several seasons of historical data.
  • Robust to missing data, outliers, and shift in trend
format_ts <- function(ts) {
    ds <- as.Date(zoo::as.yearmon(time(ts)))
    y <- as.matrix(ts)
    return(data.frame(ds = ds, y = y))
}

air_passengers <- format_ts(train) # format the input as required by Prophet

# fitting 
mod4 <- prophet(air_passengers, seasonality.mode = "multiplicative")
## Initial log joint probability = -2.41026
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

Prophet - Validation

future <- make_future_dataframe(mod4, 
                                periods = 12, 
                                freq = 'month')
forecast <- predict(mod4, future)

yhat <- forecast$yhat[133:144] # subtract test set period
rmse <- sqrt(mean(yhat - test)^2 ) # calculate RMSE score
paste('RMSE: ', rmse)
## [1] "RMSE:  15.387713599209"

Train on the whole set and Forecast

air_passengers <- format_ts(AirPassengers)

prophet_mod <- prophet(air_passengers, seasonality.mode = "multiplicative")
## Initial log joint probability = -2.46502
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
future <- make_future_dataframe(prophet_mod, 
                                periods = 12, 
                                freq = 'month')
forecast <- predict(prophet_mod, future)

Forecast Plot

dyplot.prophet(prophet_mod, forecast)

Other stuff??

  • Tune hyperparameters for your data
  • Python? (i.e. statsmodels)
  • Feature engineering
  • Tree-based algorithm (i.e. LightGBM)
  • Deep Learning (i.e. LSTM)