Batchwise backfitting estimation engine for GAMLSS using very large data sets.

## Batchwise backfitting engine.
opt_bbfit(x, y, family, shuffle = TRUE, start = NULL, offset = NULL,
  epochs = 1, nbatch = 10, verbose = TRUE, ...)

bbfit(x, y, family, shuffle = TRUE, start = NULL, offset = NULL,
  epochs = 1, nbatch = 10, verbose = TRUE, ...)

## Parallel version.
opt_bbfitp(x, y, family, mc.cores = 1, ...)

## Loglik contribution plot.
contribplot(x, ...)

Arguments

x

For function bfit() the x list, as returned from function bamlss.frame, holding all model matrices and other information that is used for fitting the model. For the updating functions an object as returned from function smooth.construct or smoothCon. For function contribplot(), a "bamlss" object using bbfit() with argument select = TRUE.

y

The model response, as returned from function bamlss.frame.

family

A bamlss family object, see family.bamlss.

shuffle

Should observations be shuffled?

start

A named numeric vector containing possible starting values, the names are based on function parameters.

offset

Can be used to supply model offsets for use in fitting, returned from function bamlss.frame.

epochs

For how many epochs should the algorithm run?

nbatch

Number of batches. Can also be a number between 0 and 1, i.e., determining the fraction of observations that should be used for fitting.

verbose

Print information during runtime of the algorithm.

mc.cores

On how many cores should estimation be started?

...

For bbfitp() all arguments to be passed to bbfit().

Details

The algorithm uses batch-wise estimation of regression coefficients and smoothing variances. The smoothing variances are estimated on an hold-out batch. This way, models for very large data sets can be estimated. Note, the algorithm can work in combination with the ff and ffbase package, i.e., the entire data is never in the computer RAM. Therefore, the data can either to be stored as comma separated file on disc or provided as "ffdf" data frame, see also the examples.

The optimizer functions use additional arguments:

  • batch_ids. This argument can either be a list of indices specifying the batches that should be used for estimation, or a vector of length 2, where the first element specifies the number of observations that should be sampled for each batch and the second argument specifies the number of batches, see the example.

  • nu, the step length control parameter. Defaults to nu = 0.05. If argument slice = TRUE then nu = 1.

  • loglik, defaults to loglik = FALSE. If set to loglik = TRUE the "out-of-sample" log-likelihood is used for smoothing variance estimation.

  • aic, defaults to aic = FALSE, If set to aic = TRUE the "out-of-sample" AIC is used for smoothing variance estimation.

  • eps_loglik, defaults to eps_loglik = 0.01. This argument specifies the relative change in the "out-of-sample" log-likelihood that is needed such that a model term gets updated.

  • select, defaults to select = FALSE. If set to select = TRUE, the algorithm only selects the model term with the largest contribution in the "out-of-sample" log-likelihood for updating in each iteration/batch.

  • always, defaults to always = FALSE. If set to always = TRUE no log-likelihood contribution checks will be used and model terms are always updated.

  • K, defaults to K = 2. This argument controls the penalty on the degrees of freedom in the computation of the AIC.

  • slice, defaults to slice = FALSE. If set to slice = TRUE, slice sampling using the "out-of-sample" log-likelihood or AIC is used for smoothing variance estimation. Moreover, always = TRUE, eps_loglik = -Inf and nu = 1. If slice is an integer n, slice sampling is started after n iterations, before smoothing variances are optimized.

When using function opt_bbfitp, the parameter updates are stored as "mcmc" objects. In this case the traceplots can be visualized using plot.bamlss.

Value

For function opt_bbfit() a list containing the following objects:

fitted.values

A named list of the fitted values of the modeled parameters of the selected distribution.

parameters

The estimated set regression coefficients and smoothing variances.

shuffle

Logical

runtime

The runtime of the algorithm.

See also

Examples

if (FALSE) ## Simulate data.
set.seed(123)
d <- GAMart(n = 27000, sd = -1)

## Write data to disc.
tf <- tempdir()
write.table(d, file.path(tf, "d.raw"), quote = FALSE, row.names = FALSE, sep = ",")

## Model formula.
f <- list(
  y ~ s(x1,k=40) + s(x2,k=40) + s(x3,k=40) + te(lon,lat,k=10),
  sigma ~ s(x1,k=40) + s(x2,k=40) + s(x3,k=40) + te(lon,lat,k=10)
)

## Specify 50 batches with 1000 observations.
batch_ids <- c("nobs" = 1000, "nbatch" = 50)

## Note, can also be a list of indices, e.g.
## batch_ids <- lapply(1:50, function(i) { sample(1:nrow(d), size = 1000) })

## Different flavors:
## (1) Using "out-of-sample" aic for smoothing
##     variance estimation. Update is only accepted
##     if the "out-of-sample" log-likelihood is
##     increased. If data is a filepath, the data set is
##     read into R using package ff and model and
##     design matrices are processed with ff. This may
##     take some time depending on the size of the data.
set.seed(1)
b1 <- bamlss(f, data = file.path(tf, "d.raw"),
  sampler = FALSE, optimizer = opt_bbfit,
  batch_ids = batch_ids, nu = 0.1, aic = TRUE, eps_loglik = -Inf,
  always = FALSE)
#>   .. creating directory 'ff_data_bamlss' for storing matrices. Note, the directory is may not deleted and matrices can be used for another model. Use delete = TRUE in the bamlss call. Before starting a new model you can set overwrite = TRUE to overwrite existing data.
#>   .. ..     100%  .. ff processing term s(x1) 
#>   .. ..     100%
#>   .. ff processing term s(x2) 
#>   .. ..     100%
#>   .. ff processing term s(x3) 
#>   .. ..     100%
#>   .. ff processing term te(lon,lat) 
#>   .. ..     100%
#>   .. ..     100%  .. ff processing term s(x1) 
#>   .. ff processing term s(x2) 
#>   .. ff processing term s(x3) 
#>   .. ff processing term te(lon,lat) 
#>   .. df mu term s(x1) 4.000005 
#>   .. df mu term s(x2) 4.000005 
#>   .. df mu term s(x3) 4.000005 
#>   .. df mu term te(lon,lat) 10.0012 
#> opening ff ff_data_bamlss/sx1.ff
#>   .. df sigma term s(x1) 4.000004 
#> opening ff ff_data_bamlss/sx2.ff
#>   .. df sigma term s(x2) 4.000005 
#> opening ff ff_data_bamlss/sx3.ff
#>   .. df sigma term s(x3) 4.000005 
#> opening ff ff_data_bamlss/telonlat.ff
#>   .. df sigma term te(lon,lat) 10.00654 
#>    * iter 1, nobs 1000, edf 27.080100
   * iter 2, nobs 2000, eps 1.952600, edf 27.560000
   * iter 3, nobs 3000, eps 1.510800, edf 32.280000
   * iter 4, nobs 4000, eps 0.496400, edf 31.620000
   * iter 5, nobs 5000, eps 0.242300, edf 36.880000
   * iter 6, nobs 6000, eps 0.218300, edf 26.780000
   * iter 7, nobs 7000, eps 0.122300, edf 34.900000
   * iter 8, nobs 8000, eps 0.079800, edf 27.800000
   * iter 9, nobs 9000, eps 0.109500, edf 29.700000
   * iter 10, nobs 10000, eps 0.100300, edf 28.980000
   * iter 11, nobs 11000, eps 0.103600, edf 29.210000
   * iter 12, nobs 12000, eps 0.082500, edf 30.680000
   * iter 13, nobs 13000, eps 0.080700, edf 29.100000
   * iter 14, nobs 14000, eps 0.072200, edf 28.380000
   * iter 15, nobs 15000, eps 0.046900, edf 30.420000
   * iter 16, nobs 16000, eps 0.085100, edf 28.350000
   * iter 17, nobs 17000, eps 0.043600, edf 31.340000
   * iter 18, nobs 18000, eps 0.055500, edf 32.500000
   * iter 19, nobs 19000, eps 0.049400, edf 30.870000
   * iter 20, nobs 20000, eps 0.054800, edf 26.390000
   * iter 21, nobs 21000, eps 0.044100, edf 31.980000
   * iter 22, nobs 22000, eps 0.103400, edf 33.770000
   * iter 23, nobs 23000, eps 0.056100, edf 26.400000
   * iter 24, nobs 24000, eps 0.055100, edf 23.320000
   * iter 25, nobs 25000, eps 0.088600, edf 31.510000
   * iter 26, nobs 26000, eps 0.055500, edf 28.650000
   * iter 27, nobs 27000, eps 0.036900, edf 29.890000
   * iter 28, nobs 28000, eps 0.054900, edf 25.590000
   * iter 29, nobs 29000, eps 0.027500, edf 24.870000
   * iter 30, nobs 30000, eps 0.044300, edf 31.670000
   * iter 31, nobs 31000, eps 0.032600, edf 18.770000
   * iter 32, nobs 32000, eps 0.098100, edf 22.390000
   * iter 33, nobs 33000, eps 0.035400, edf 21.710000
   * iter 34, nobs 34000, eps 0.033600, edf 21.840000
   * iter 35, nobs 35000, eps 0.040700, edf 21.380000
   * iter 36, nobs 36000, eps 0.027200, edf 19.060000
   * iter 37, nobs 37000, eps 0.027500, edf 15.450000
   * iter 38, nobs 38000, eps 0.063000, edf 10.860000
   * iter 39, nobs 39000, eps 0.034600, edf 33.240000
   * iter 40, nobs 40000, eps 0.437100, edf 22.360000
   * iter 41, nobs 41000, eps 0.051400, edf 21.000000
   * iter 42, nobs 42000, eps 0.216400, edf 12.680000
   * iter 43, nobs 43000, eps 0.007600, edf 13.890000
   * iter 44, nobs 44000, eps 0.047900, edf 34.380000
   * iter 45, nobs 45000, eps 0.051300, edf 28.020000
   * iter 46, nobs 46000, eps 0.083900, edf 22.960000
   * iter 47, nobs 47000, eps 0.015900, edf 7.900000
   * iter 48, nobs 48000, eps 0.043300, edf 27.470000
   * iter 49, nobs 49000, eps 0.018300, edf 17.370000
   * iter 50, nobs 50000, eps 0.051500, edf 39.100000

#> 
#> elapsed time: 23.86sec

## Plot estimated effects.
## plot(b1)

## Plot coefficient paths for x3 in mu.
## pathplot(b1, name = "mu.s.s(x3).b")

## (2) Same but always update, this mimics the classic SGD.
##     Note, for prediction only the last iteration is
##     used in this case. To use more iterations use opt_bbfitp(),
##     Then iterations are stored as "mcmc" object and we can
##     predict using the burnin argment, e.g.,
##     p <- predict(b2, model = "mu", burnin = 35)
set.seed(2)
b2 <- bamlss(f, data = file.path(tf, "d.raw"),
  sampler = FALSE, optimizer = opt_bbfit,
  batch_ids = batch_ids, nu = 0.1, aic = TRUE, eps_loglik = -Inf,
  always = TRUE)
#>   .. creating directory 'ff_data_bamlss' for storing matrices. Note, the directory is may not deleted and matrices can be used for another model. Use delete = TRUE in the bamlss call. Before starting a new model you can set overwrite = TRUE to overwrite existing data.
#>   .. ..     100%  .. ff processing term s(x1) 
#>   .. ..     100%
#>   .. ff processing term s(x2) 
#>   .. ..     100%
#>   .. ff processing term s(x3) 
#>   .. ..     100%
#>   .. ff processing term te(lon,lat) 
#>   .. ..     100%
#>   .. ..     100%  .. ff processing term s(x1) 
#>   .. ff processing term s(x2) 
#>   .. ff processing term s(x3) 
#>   .. ff processing term te(lon,lat) 
#>   .. df mu term s(x1) 4.000004 
#>   .. df mu term s(x2) 4.000004 
#>   .. df mu term s(x3) 4.000005 
#>   .. df mu term te(lon,lat) 10.01284 
#> opening ff ff_data_bamlss/sx1.ff
#>   .. df sigma term s(x1) 4.000004 
#> opening ff ff_data_bamlss/sx2.ff
#>   .. df sigma term s(x2) 4.000004 
#> opening ff ff_data_bamlss/sx3.ff
#>   .. df sigma term s(x3) 4.000005 
#> opening ff ff_data_bamlss/telonlat.ff
#>   .. df sigma term te(lon,lat) 10.00109 
#>    * iter 1, nobs 1000, edf 33.481200
   * iter 2, nobs 2000, eps 1.465300, edf 34.570000
   * iter 3, nobs 3000, eps 0.632200, edf 34.570000
   * iter 4, nobs 4000, eps 0.240500, edf 27.890000
   * iter 5, nobs 5000, eps 0.239400, edf 34.600000
   * iter 6, nobs 6000, eps 0.191600, edf 32.290000
   * iter 7, nobs 7000, eps 0.281200, edf 35.120000
   * iter 8, nobs 8000, eps 0.087700, edf 28.560000
   * iter 9, nobs 9000, eps 0.143000, edf 27.790000
   * iter 10, nobs 10000, eps 0.098200, edf 31.850000
   * iter 11, nobs 11000, eps 0.081400, edf 31.300000
   * iter 12, nobs 12000, eps 0.075100, edf 32.160000
   * iter 13, nobs 13000, eps 0.057200, edf 30.660000
   * iter 14, nobs 14000, eps 0.143900, edf 32.030000
   * iter 15, nobs 15000, eps 0.100600, edf 29.400000
   * iter 16, nobs 16000, eps 0.084300, edf 30.340000
   * iter 17, nobs 17000, eps 0.099700, edf 30.320000
   * iter 18, nobs 18000, eps 0.044500, edf 29.600000
   * iter 19, nobs 19000, eps 0.050300, edf 29.230000
   * iter 20, nobs 20000, eps 0.444900, edf 30.980000
   * iter 21, nobs 21000, eps 0.054200, edf 30.980000
   * iter 22, nobs 22000, eps 0.060800, edf 31.070000
   * iter 23, nobs 23000, eps 0.051500, edf 32.140000
   * iter 24, nobs 24000, eps 0.048000, edf 29.910000
   * iter 25, nobs 25000, eps 0.075300, edf 29.830000
   * iter 26, nobs 26000, eps 0.046900, edf 31.940000
   * iter 27, nobs 27000, eps 0.026700, edf 32.440000
   * iter 28, nobs 28000, eps 0.025800, edf 27.150000
   * iter 29, nobs 29000, eps 0.046800, edf 29.330000
   * iter 30, nobs 30000, eps 0.045000, edf 32.780000
   * iter 31, nobs 31000, eps 0.121300, edf 29.490000
   * iter 32, nobs 32000, eps 0.043400, edf 27.670000
   * iter 33, nobs 33000, eps 0.032400, edf 33.820000
   * iter 34, nobs 34000, eps 0.031600, edf 27.860000
   * iter 35, nobs 35000, eps 0.037800, edf 31.720000
   * iter 36, nobs 36000, eps 0.091400, edf 31.170000
   * iter 37, nobs 37000, eps 0.046900, edf 33.840000
   * iter 38, nobs 38000, eps 0.034600, edf 34.190000
   * iter 39, nobs 39000, eps 0.036400, edf 33.750000
   * iter 40, nobs 40000, eps 0.047900, edf 31.440000
   * iter 41, nobs 41000, eps 0.024200, edf 30.870000
   * iter 42, nobs 42000, eps 0.079900, edf 34.080000
   * iter 43, nobs 43000, eps 0.057700, edf 29.520000
   * iter 44, nobs 44000, eps 0.132700, edf 35.900000
   * iter 45, nobs 45000, eps 0.048200, edf 36.380000
   * iter 46, nobs 46000, eps 0.065000, edf 39.180000
   * iter 47, nobs 47000, eps 0.038100, edf 33.960000
   * iter 48, nobs 48000, eps 0.036300, edf 36.370000
   * iter 49, nobs 49000, eps 0.058100, edf 33.570000
   * iter 50, nobs 50000, eps 0.127600, edf 32.940000

#> 
#> elapsed time: 23.06sec

## Plot coefficient paths for x3 in mu.
## pathplot(b2, name = "mu.s.s(x3).b")

## (3) Boosting type flavor, only update model term with
##     the largest contribution in the "out-of-sample"
##     log-likelihood. In this case, if edf = 0 during
##     runtime of the algorithm, no model has an additional
##     contribution and the algorithm converges. This
##     behavior is controlled by argument eps_loglik, the
##     higher eps_loglik, the more restrictive is the
##     updating step.

## Initialize intercepts.
set.seed(0)
b0 <- bamlss(num ~ 1, data = d, sampler = FALSE, optimizer = opt_bbfitp,
  batch_ids = batch_ids, slice = TRUE)
#> Error in eval(predvars, data, env): object 'num' not found

## Compute starting values, remove the first
## 10 iterates and compute the mean of the
## remaining iterates.
start <- coef(b0, FUN = mean, burnin = 10)
#> Error in coef(b0, FUN = mean, burnin = 10): object 'b0' not found

## Start boosting, only update if change in
## "out-of-sample" log-likelihood is 0.1
## eps_loglik = 0.001.
set.seed(3)
b3 <- bamlss(f, data = d, sampler = FALSE, optimizer = opt_bbfit,
    batch_ids = batch_ids, nu = 0.1, aic = TRUE, eps_loglik = 0.001,
    select = TRUE, always = FALSE, start = start)
#> Error in start[paste0(i, ".p.", colnames(x[[i]]$model.matrix))]: object of type 'closure' is not subsettable

## Plot log-likelihood contributions.
## contribplot(b3)
## In this case, the algorithm did not converge,
## we need more iterations/batches.

## Note, prediction uses last iterate.
p3 <- predict(b3, model = "mu")
#> Error in predict(b3, model = "mu"): object 'b3' not found

## (4) Use slice sampling under the "out-of-sample"
##     log likelihood for estimation of smoothing
##     variances. In this case model terms are always
##     updated ad parameter paths behave like a MCMC
##     chain. Therefore, use opt_bbfitp(), which stores
##     parameter paths as "mcmc" objects and we can
##     inspect using traceplots. Note nu = 1 if
##     slice = TRUE.
set.seed(4)
b4 <- bamlss(f, data = d, sampler = FALSE, optimizer = opt_bbfitp,
  batch_ids = batch_ids, aic = TRUE, slice = TRUE)
#>   .. df mu term s(x1) 4.000001 
#>   .. df mu term s(x2) 4.000001 
#>   .. df mu term s(x3) 4.000001 
#>   .. df mu term te(lon,lat) 10.00511 
#>   .. df sigma term s(x1) 4.000001 
#>   .. df sigma term s(x2) 4.000001 
#>   .. df sigma term s(x3) 4.000001 
#>   .. df sigma term te(lon,lat) 9.999999 
#>    * iter 1, nobs 1000, edf 42.375800
   * iter 2, nobs 2000, eps 0.668700, edf 39.730000
   * iter 3, nobs 3000, eps 0.947800, edf 44.540000
   * iter 4, nobs 4000, eps 0.714500, edf 47.630000
   * iter 5, nobs 5000, eps 1.567000, edf 47.020000
   * iter 6, nobs 6000, eps 0.779100, edf 39.980000
   * iter 7, nobs 7000, eps 1.540500, edf 41.130000
   * iter 8, nobs 8000, eps 0.690600, edf 39.430000
   * iter 9, nobs 9000, eps 1.081500, edf 40.790000
   * iter 10, nobs 10000, eps 0.713700, edf 35.940000
   * iter 11, nobs 11000, eps 0.614100, edf 35.050000
   * iter 12, nobs 12000, eps 0.725900, edf 35.230000
   * iter 13, nobs 13000, eps 0.647100, edf 42.490000
   * iter 14, nobs 14000, eps 1.400600, edf 48.700000
   * iter 15, nobs 15000, eps 0.598500, edf 51.240000
   * iter 16, nobs 16000, eps 0.771900, edf 45.120000
   * iter 17, nobs 17000, eps 0.951200, edf 42.990000
   * iter 18, nobs 18000, eps 0.728000, edf 45.860000
   * iter 19, nobs 19000, eps 6.769500, edf 41.650000
   * iter 20, nobs 20000, eps 0.863600, edf 42.020000
   * iter 21, nobs 21000, eps 1.067900, edf 40.000000
   * iter 22, nobs 22000, eps 1.304300, edf 42.820000
   * iter 23, nobs 23000, eps 0.905400, edf 42.450000
   * iter 24, nobs 24000, eps 0.814300, edf 44.000000
   * iter 25, nobs 25000, eps 0.604800, edf 36.220000
   * iter 26, nobs 26000, eps 0.653800, edf 43.940000
   * iter 27, nobs 27000, eps 0.827700, edf 47.020000
   * iter 28, nobs 28000, eps 0.870500, edf 49.360000
   * iter 29, nobs 29000, eps 0.973400, edf 42.350000
   * iter 30, nobs 30000, eps 0.651400, edf 49.110000
   * iter 31, nobs 31000, eps 0.720200, edf 46.580000
   * iter 32, nobs 32000, eps 0.903500, edf 47.630000
   * iter 33, nobs 33000, eps 3.202200, edf 42.550000
   * iter 34, nobs 34000, eps 1.014400, edf 43.860000
   * iter 35, nobs 35000, eps 0.765900, edf 46.190000
   * iter 36, nobs 36000, eps 0.976500, edf 46.420000
   * iter 37, nobs 37000, eps 1.931500, edf 46.580000
   * iter 38, nobs 38000, eps 0.537900, edf 39.260000
   * iter 39, nobs 39000, eps 0.910900, edf 41.610000
   * iter 40, nobs 40000, eps 4.386900, edf 41.060000
   * iter 41, nobs 41000, eps 0.553400, edf 38.960000
   * iter 42, nobs 42000, eps 0.846100, edf 41.230000
   * iter 43, nobs 43000, eps 1.410300, edf 48.280000
   * iter 44, nobs 44000, eps 0.638400, edf 38.390000
   * iter 45, nobs 45000, eps 1.625000, edf 40.390000
   * iter 46, nobs 46000, eps 0.763700, edf 39.120000
   * iter 47, nobs 47000, eps 1.464600, edf 41.890000
   * iter 48, nobs 48000, eps 3.535800, edf 35.830000
   * iter 49, nobs 49000, eps 0.543300, edf 30.000000
   * iter 50, nobs 50000, eps 0.909500, edf 33.310000

#> 
#> elapsed time:  6.34sec

## plot(b4)

## Plot parameter updates/samples.
## plot(b4, which = "samples")

## Predict with burnin and compute mean
## prediction of the last 20 iterates.
p4 <- predict(b4, model = "mu", burnin = 30, FUN = mean)