back

next

 

 

 

 

Forecasting with R and SQL

 

 

OTech Magazine #6 – Winter 2014 – featured an article by me called “Time Series Forecasting in SQL.”

I demonstrated how to forecast with a simple time series model solely using analytical functions, enabling

me to forecast for hundred thousand items in 1½ minute.

 

But that was just a single statistical model… There are many variants of forecasting models, and who knows if the one model I programmed in SQL is the best one for my data?

 

R, on the other hand, is a programming language for statistical computing and plotting. It is widely used by statisticians and data miners, and there are many packages in R for various statistic analysis of data. I can find R packages containing many forecasting models built-in with the capability of automatically running multiple models and comparing accuracies for selecting the best model for the actual data.

 

This article describes how I can accomplish forecasting with multiple statistical models using R connecting to the database with the free ROracle driver.

 

 

R environment used

 

Download and install R from for example:  http://cran.rstudio.com/

Download and install RStudio Desktop (Open Source Ed.):

http://www.rstudio.com/products/RStudio/

Download ROracle zip file from OTN:

http://www.oracle.com/technetwork/database/database-technologies/r/roracle/downloads/index.html

 

In RStudio :

- Menu “Tools | Check for Package Updates”: select all and update.

- Menu “Tools | Install Packages”: install package “forecast” with dependencies.

- Menu “Tools | Install Packages”: install package “DBI” with dependencies.

- Menu “Tools | Install Packages”: install the ROracle zip using “Install from file.”

 

Documentation for ROracle methods can be found here:

http://cran.r-project.org/web/packages/ROracle/ROracle.pdf

 

The scripts (both SQL and R) used for this article can be downloaded here:

http://goo.gl/ihW632

 

 

A couple of representative products

 

I’ll make a table containing montly sales for two items (products):

create table sales (

   item varchar2(10)

 , mth  date

 , qty  number

);

 

insert into sales values ('Snowchain', date '2011-01-01',   79);

insert into sales values ('Snowchain', date '2011-02-01',  133);

insert into sales values ('Snowchain', date '2011-03-01',   24);

...

insert into sales values ('Sunshade' , date '2013-10-01',   11);

insert into sales values ('Sunshade' , date '2013-11-01',    3);

insert into sales values ('Sunshade' , date '2013-12-01',    5);

 

The snowchain sells well every winter but hardly anything in the summertime and the trend is that it sells more and more every year. The sunshade on the other hand sells well in summer with a downward trend less and less every year.

 

 

Figure 1: 2011-2013 sales for the two items

 

 

Getting the data into R

 

I tell R I want to use the ROracle and forecast packages:

library(ROracle)

library(forecast)

 

I get a connection to the database (use your own connection info):

ora.username <- "hr"

ora.password <- "oracle"

ora.host     <- "localhost"

ora.port     <- "1521"

ora.service  <- "orcl"

 

ora.drv <- dbDriver("Oracle")

ora.con <- dbConnect(

  ora.drv,

  username = ora.username,

  password = ora.password,

  dbname = sprintf(

    "(DESCRIPTION = (ADDRESS_LIST =

      (ADDRESS = (PROTOCOL = TCP)(HOST = %s)(PORT = %s))

      )(CONNECT_DATA =(SERVICE_NAME = %s)))",

    ora.host, ora.port, ora.service

  )

)

 

dbGetQuery(ora.con, "alter session set nls_date_format='YYYY-MM-DD'")

 

(Note that ora.username is not an object ora with an attribute username. The dots used like this in R is simply part of name, so the variable is called ora.username. Using prefix and dot in variable names is just to group related variables.)

 

I make some variables for what item I want and what period:

parm.item   <- "Sunshade"

parm.fcyear <- 2014

parm.fcmth  <- 1

 

fc.startyear <- parm.fcyear - 3

fc.startmth  <- parm.fcmth

if (parm.fcmth == 1) {

  fc.endyear <- parm.fcyear - 1

  fc.endmth  <- 12

} else {

  fc.endyear <- parm.fcyear

  fc.endmth  <- parm.fcmth - 1

}

 

I want to forecast Sunshade from January 2014 and 12 months forward. So the 3 years data I want to read will be from January 2011 until December 2013.

 

Having calculated the period, I create a dataframe with values for binding:

fc.binds <- data.frame(

  item      = parm.item,

  startdate = sprintf('%04i-%02i-01',fc.startyear,fc.startmth),

  enddate   = sprintf('%04i-%02i-01',fc.endyear,fc.endmth),  stringsAsFactors = FALSE

)

 

I can now create my query:

fc.qry <- dbSendQuery(

  ora.con,

  "select qty

    from sales

   where item = :1

     and mth >= to_date(:2,'YYYY-MM-DD')

     and mth <= to_date(:3,'YYYY-MM-DD')

   order by mth",

  fc.binds

)

 

And fetch the data:

fc.sales <- dbFetch(fc.qry)

To view the data in the R console, I simply type the variable name:

fc.sales

And get this output:

   QTY

1    4

2    6

3   32

4   45

...

33  13

34  11

35   3

36   5

 

I turn those data into a Time Series object:

fc.ts <- ts(

  fc.sales$QTY,

  frequency = 12,

  start     = c(fc.startyear, fc.startmth),

  end       = c(fc.endyear, fc.endmth)

)

 

And I can view the time series data as well:

fc.ts

 

Which gives this result:

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2011   4   6  32  45  62  58  85  28  24  19   6   8

2012   2  13  29  60  29  78  56  22  11  13   5   3

2013   2   8  28  26  23  46  73  25  13  11   3   5

 

Or alternative to viewing the raw numbers, I can plot a graph:

plot(fc.ts, xlim=range(2011:2015), ylim=range(0:100))

 

Figure 2: Sunshade 2011-2013 timeseries plot

 

 

 

 advertisement

 

 

 

Forecasting

 

With the time series object, I can do forecasts using many different models. If I don’t know what model fits best, an easy start can be to let R decide:

fc.forecast <- forecast(fc.ts, h=12)

 

I can then simply view the forecast result:

fc.forecast

 

Which returns the forecast including confidence intervals:

         Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95

Jan 2014       2.110382  1.442871  2.777893  1.089512  3.131252

Feb 2014       7.540539  5.092141  9.988938  3.796037 11.285042

Mar 2014      21.250707 14.176120 28.325294 10.431060 32.070354

Apr 2014      34.251114 22.573197 45.929031 16.391281 52.110946

May 2014      29.132621 18.970345 39.294897 13.590762 44.674481

Jun 2014      54.531521 35.087999 73.975042 24.795220 84.267821

Jul 2014      53.062682 33.740310 72.385054 23.511664 82.613700

Aug 2014      19.862952 12.481938 27.243965  8.574666 31.151238

Sep 2014      13.624737  8.461952 18.787523  5.728938 21.520537

Oct 2014      11.683766  7.172190 16.195342  4.783906 18.583626

Nov 2014       3.756779  2.279451  5.234107  1.497401  6.016157

Dec 2014       4.691476  2.813754  6.569197  1.819749  7.563203

 

Or in a plot:

plot(fc.forecast, xlim=range(2011:2015), ylim=range(0:100))

 

Figure 3: Forecast using automatically chosen model

 

Note the forecast automatically chose model (M,N,M). It did this by running multiple models and determining which one mathematically fit the data best.

 

 

Other models

 

If I am unhappy with the model chosen by the automatic forecast, I can do other models by creating a specific model object and performing the forecast on that.

For example I can choose model (A,N,A) instead of (M,N,M) like this:

fc.etsana   <- ets(fc.ts, model='ANA')

fc.fcetsana <- forecast(fc.etsana, h=12)

plot(fc.fcetsana, xlim=range(2011:2015), ylim=range(0:100))

 

 

Figure 4: Forecast using manually chosen (A,N,A) model

 

Or I can move away from the simpler time series models and instead use Arima models. For example use auto.arima that picks the best fitting Arima model:

fc.arima    <- auto.arima(fc.ts)

fc.fcarima  <- forecast(fc.arima, h=12)

plot(fc.fcarima, xlim=range(2011:2015), ylim=range(0:100))

 

 

Figure 5: Forecast using automatically chosen Arima model

 

If I wish I can do manual Arima models specifying different parameters than (0,0,0)(1,0,0), or I can use Holt-Winters or Croston or exponential smoothing models.

The forecast package supports many different models. You can find more information on github:

https://github.com/robjhyndman/forecast

 

 

Returning the forecast to Oracle

 

Let’s say I’ve picked the automatic forecast as the one I like best. And I’d like to insert the forecast into a table in my database. I could do just the actual forecast, but I want also to insert the confidence intervals, so I create this table:

create table r_forecast

(

   item varchar2(10)

 , mth  date

 , mean number

 , lo80 number

 , hi80 number

 , lo95 number

 , hi95 number

);

 

In R I will create a dataframe containing what I want to insert:

fc.data <- data.frame(

  ITEM = parm.item,

  MTH  = sprintf(

           '%04i-%02i-01',

           trunc(time(fc.forecast$mean)[1:12]),

           round(1+12*time(fc.forecast$mean)[1:12]%%1)

         ),

  MEAN = fc.forecast$mean[1:12],

  LO80 = fc.forecast$lower[,1],

  HI80 = fc.forecast$upper[,1],

  LO95 = fc.forecast$lower[,2],

  HI95 = fc.forecast$upper[,2]

)

 

That creates 12 rows. A few remarks may be needed here:

fc.forecast is an object with elements that can be addressed with $ notation.

$lower is a two-dimensional array for the lower bounds of the confidence intervals, first index is month, second index is 1=80% confidence, 2=95% confidence (the percentages could be chosen in the forecast call, 80% and 95% are the defaults.) So fc.forecast$lower[,1] is a one-dimensional array with the 12 values for each month for LO80 (lower bound of 80% confidence interval.)

Similar for HI80, LO95 and HI95.

 

$mean is a time series object containing the actual forecast. $mean[1:12] is the actual data, while the time() function returns the 12 individual times of the time series object. So I can observe the values of that function result:

time(fc.forecast$mean)[1:12]

 

And I will see 12 number values measured in years:

 [1] 2014.000 2014.083 2014.167 2014.250 2014.333 2014.417

 [7] 2014.500 2014.583 2014.667 2014.750 2014.833 2014.917

 

So I use trunc(time(fc.forecast$mean)[1:12]) to get an array of 12 whole years, and round(1+12*time(fc.forecast$mean)[1:12]%%1) to get an array of 12 month numbers. Then sprintf() formats the 12 year and months into format YYYY-MM-DD.

 

At the very beginning when creating the database connection I set the session NLS_DATE_FORMAT to YYYY-MM-DD to make it easy here. It is possible to use “proper Date” datatypes in R and get them into Oracle, but it requires quite a bit of worrying about timezones, which will have to be saved for another article.

My parm.item is not an array of 12 values, it is just a single value. But when the various parts of the dataframe does not have the same number of elements, those that are “too short” are simply repeated until they all have same length.

 

So I can view the contents of my constructed dataframe:

fc.data

 

And it looks very much like what I want to insert into my database table:

       ITEM        MTH      MEAN      LO80      HI80      LO95      HI95

1  Sunshade 2014-01-01  2.110382  1.442871  2.777893  1.089512  3.131252

2  Sunshade 2014-02-01  7.540539  5.092141  9.988938  3.796037 11.285042

3  Sunshade 2014-03-01 21.250707 14.176120 28.325294 10.431060 32.070354

4  Sunshade 2014-04-01 34.251114 22.573197 45.929031 16.391281 52.110946

5  Sunshade 2014-05-01 29.132621 18.970345 39.294897 13.590762 44.674481

6  Sunshade 2014-06-01 54.531521 35.087999 73.975042 24.795220 84.267821

7  Sunshade 2014-07-01 53.062682 33.740310 72.385054 23.511664 82.613700

8  Sunshade 2014-08-01 19.862952 12.481938 27.243965  8.574666 31.151238

9  Sunshade 2014-09-01 13.624737  8.461952 18.787523  5.728938 21.520537

10 Sunshade 2014-10-01 11.683766  7.172190 16.195342  4.783906 18.583626

11 Sunshade 2014-11-01  3.756779  2.279451  5.234107  1.497401  6.016157

12 Sunshade 2014-12-01  4.691476  2.813754  6.569197  1.819749  7.563203

 

Inserting then becomes quite simple:

dbWriteTable(ora.con, "R_FORECAST", fc.data, append=TRUE)

If the table does not exist, dbWriteTable() will create the table, though it will be with some default datatypes that may not be what you want. For adhoc stuff it’ll probably be OK. If append=FALSE, it will drop the table and recreate it.

One thing here to note: dbWriteTable() behaves like DDL even in those cases where it does not create table, but simply inserts (appends) to existing table. Which means that it commits! You can’t use it as part of a transaction.

So I can now query my forecast table in the database:

select *

  from r_forecast

 where item = 'Sunshade'

 order by item, mth;

 

And see I got the desired result:

ITEM       MTH             MEAN      LO80      HI80      LO95      HI95

---------- ---------- --------- --------- --------- --------- ---------

Sunshade   2014-01-01     2.110     1.443     2.778     1.090     3.131

Sunshade   2014-02-01     7.541     5.092     9.989     3.796    11.285

Sunshade   2014-03-01    21.251    14.176    28.325    10.431    32.070

Sunshade   2014-04-01    34.251    22.573    45.929    16.391    52.111

Sunshade   2014-05-01    29.133    18.970    39.295    13.591    44.674

Sunshade   2014-06-01    54.532    35.088    73.975    24.795    84.268

Sunshade   2014-07-01    53.063    33.740    72.385    23.512    82.614

Sunshade   2014-08-01    19.863    12.482    27.244     8.575    31.151

Sunshade   2014-09-01    13.625     8.462    18.788     5.729    21.521

Sunshade   2014-10-01    11.684     7.172    16.195     4.784    18.584

Sunshade   2014-11-01     3.757     2.279     5.234     1.497     6.016

Sunshade   2014-12-01     4.691     2.814     6.569     1.820     7.563

 

 

Repeating for multiple items

 

So far I’ve covered Forecast1.R script doing forecast for item ‘Sunshade’ only. Now I move on to Forecast2.R script looping over multiple items.

truncate table r_forecast;

 

Having cleared the forecast table, in R I query the item id’s I’ll loop over:

item.binds <- data.frame(

  startdate = sprintf('%04i-%02i-01',fc.startyear,fc.startmth),

  enddate   = sprintf('%04i-%02i-01',fc.endyear,fc.endmth),

  stringsAsFactors = FALSE

)

 

item.qry <- dbSendQuery(

  ora.con,

  "select item

    from sales

    where mth >= to_date(:1,'YYYY-MM-DD')

      and mth <= to_date(:2,'YYYY-MM-DD')

    group by item

   having count(*) = 36

    order by item",

  item.binds

)

 

item.items <- dbFetch(item.qry)

 

Inspecting item.items I have this result:

      ITEM

1 Snowchain

2  Sunshade

 

Naturally I could have queried an item table, but in this case I wish to be certain I only query items that actually have a full 36 month history to get the best forecast. If that was not an issue, I could have chosen other ways.

Then R allows me to loop over the elements in item.items:

for (item.item in item.items$ITEM) {

  print(item.item)

 

  fc.binds <- data.frame(

    item      = item.item,

    startdate = sprintf('%04i-%02i-01',fc.startyear,fc.startmth),

    enddate   = sprintf('%04i-%02i-01',fc.endyear,fc.endmth),

    stringsAsFactors = FALSE

  )

 

  fc.qry <- dbSendQuery(

...

...

  dbWriteTable(ora.con, "R_FORECAST", fc.data, append=TRUE)

}

 

Inside the loop I create my fc.binds using the item.item from the loop.

Then I use the binds to query, fetch, create time series, forecast, create data frame and insert to table – just like I did before for ‘Sunshade’. It just happens for every item in the loop.

 

 

Re-forecasting as time goes by

 

Now I am going to pretend a month has gone by. Sales data has now accumulated for January 2014:

insert into sales values ('Snowchain', date '2014-01-01',   68);

insert into sales values ('Sunshade' , date '2014-01-01',    7);

commit;

 

I could now truncate the r_forecast table and change Forecast2.R script to start forecasting from February 2014:

parm.fcyear <- 2014

parm.fcmth  <- 2

 

But that would loose some forecasting history. Alternative to truncating I could delete everything in the r_forecast table later than January 2014 before running Forecast2.R script for February. But yet another alternative is shown in Forecast3.R script, where I don’t delete, but rather overwrite data in r_forecast.

Script Forecast3.R sets parm.fcmth  <- 2 as shown above so it forecasts from February 2014. It loops over the items just like Forecast2.R, but instead of inserting into r_forecast, it inserts to a temporary table:

  dbWriteTable(ora.con, "R_FORECAST_TMP", fc.data, append=TRUE)

The temporary table is temporary, but otherwise an exact copy of r_forecast:

create global temporary table r_forecast_tmp

(

   item varchar2(10)

 , mth  date

 , mean number

 , lo80 number

 , hi80 number

 , lo95 number

 , hi95 number

)

on commit preserve rows;

 

I need to preserve rows, becase dbWriteTable implicitly commits.

Having inserted to temporary table, I execute a merge and clean up after:

  dbGetQuery(

    ora.con,

    "merge into r_forecast fc

     using r_forecast_tmp fct

     on (

       fc.item = fct.item and fc.mth = fct.mth

     )

     when matched then update

       set fc.mean  = fct.mean,

           fc.lo80  = fct.lo80,

           fc.hi80  = fct.hi80,

           fc.lo95  = fct.lo95,

           fc.hi95  = fct.hi95

     when not matched then insert values (

       fct.item, fct.mth, fct.mean, fct.lo80, fct.hi80, fct.lo95, fct.hi95

     )"

  )

  dbGetQuery(ora.con, "delete from R_FORECAST_TMP")

  dbCommit(ora.con)

 

Script Forecast3.R does this within the loop, so it merges/deletes/commits for every item. It could also be that only the insert into temporary table is within the loop and then a single large merge after the loop. It’ll depend on your use case whether one or the other is preferable.

 

Having executed Forecast3.R I query the r_forecast table for Sunshade again:

ITEM       MTH             MEAN      LO80      HI80      LO95      HI95

---------- ---------- --------- --------- --------- --------- ---------

Sunshade   2014-01-01     2.110     1.443     2.778     1.090     3.131

Sunshade   2014-02-01     8.767     5.475    12.059     3.732    13.802

Sunshade   2014-03-01    28.304    17.348    39.259    11.549    45.059

Sunshade   2014-04-01    39.454    23.738    55.170    15.418    63.490

Sunshade   2014-05-01    37.933    22.405    53.462    14.184    61.683

Sunshade   2014-06-01    56.996    33.049    80.942    20.372    93.619

Sunshade   2014-07-01    70.023    39.863   100.183    23.897   116.149

Sunshade   2014-08-01    23.074    12.896    33.251     7.509    38.639

Sunshade   2014-09-01    14.839     8.143    21.535     4.598    25.079

Sunshade   2014-10-01    12.484     6.726    18.242     3.677    21.290

Sunshade   2014-11-01     4.114     2.176     6.053     1.150     7.079

Sunshade   2014-12-01     4.776     2.480     7.072     1.264     8.287

Sunshade   2015-01-01     5.305     2.704     7.906     1.327     9.283

 

The values for 2014-01-01 are unchanged, the rest has been updated with the fresh new forecast that is a bit more optimistic, since January sold more than expected. So rather than just a yearly forecast, by doing this monthly I keep improving the forecast as new sales data is accumulated.

 

 

Conclusion

 

Using forecast package in R can easily apply many models to the data, compare the accuracies, and pick the most accurate model. If the forecast package demonstrated here does not have the models you want, there are other R packages to be found out there. Same techniques can be used with those packages as well.

If you invest licence in Oracle R Enterprise, you get R capabilities within the database. That will be more efficient of course, but if your budget is limited, the free ROracle driver makes things like this possible. And if it is a matter of just letting an R script run over night once a month – well, it might be sufficient for your use case.

 

Or you can use the R scripts to research your data and find out which forecasting model suit your data best, and then you find the formulas for just that one model and code it with analytic functions, similar to how I did in last issue of OTechMag (#6 – Winter 2014.)

 

Good luck with combining R and Oracle.

 

Kim Berg Hansen

Blog

Quizzes

 

OTECH MAGAZINE #7  spring 2015        copyright otech magazine 2015

www.otechmag.com