R: Set up GAM using penalized regression splines

GAMsetup {mgcv}

R Documentation

Set up GAM using penalized regression splines

Description

Sets up design matrix X, penalty matrices S_i and linear equality constraint matrix C for a GAM defined in terms of penalized regression splines. Various other information characterising the bases used is also returned. The output is such that the model can be fitted and smoothing parameters estimated by the method of Wood (2000) as implemented in routine mgcv(). This is usually called by gam.

Usage

GAMsetup(G)

Arguments

G

is the single argument to this function: it is a list containing at least the elements listed below:

the number of smooth terms in the model

an array of G$m integers specifying the maximum d.f. for each spline term.

the number of data to be modelled

nsdf

the number of user supplied columns of the design matrix for any parametric model parts

dim

An array of dimensions for the smooths. dim[i] is the number of covariates that smooth i is a function of.

array of supplied smoothing parameters. If fit.method is "magic" then this may be a mixture of positive numbers, which are used as the smoothing parameters, and negative to indicate that the parameters are to be estimated. With "mgcv" this is unused.

fix

An array of logicals indicating whether each smooth term has fixed degrees of freedom or not.

s.type

An array giving the type of basis used for each term. 0 for cubic regression spline, 1 for t.p.r.s

p.order

An array giving the order of the penalty for each term. 0 for auto selection.

an array of G$n element arrays of data and (optionally) design matrix columns. The first G$nsdf elements of G$x should contain the elements of the columns of the design matrix corresponding to the parametric part of the model. The remaining G$m elements of G$x are the values of the covariates that are arguments of the spline terms. Note that the smooths will be centred and no intercept term will be added unless an array of 1's is supplied as part of in G$x

vnames

Array of variable names, including the constant, if present.

prior weights on response data.

a 2-d array of by variables (i.e. covariates that multiply a smooth term) by[i,j] is the jth value for the ith by variable. There are only as many rows of this array as there are by variables in the model (often 0). The rownames of by give the by variable names.

by.exists

an array of logicals: by.exists[i] is TRUE if the ith smooth has a by variable associated with it, FALSE otherwise.

knots

a compact array of user supplied knot locations for each smooth, in the order corresponding to the row order in G$x. There are G$dim[i] arrays of length G$n.knots[i] for the ith smooth - all these arrays are packed end to end in 1-d array G$knots - zero length 1 for no knots.

n.knots

array giving number of user supplied knots of basis for each smooth term 0's for none supplied.

fit.method

one of "mgcv" for the Wood (2000) method or "magic" for a more recent and in principle more stable alternative.

min.sp

lower bounds on the smoothing parameters: only possible if fit method is "magic".

the offset penalty matrix, NULL for none. This is the coefficient matrix of any user supplied fixed penalty.

Value

A list H, containing the elements of G (the input list) plus the following:

`X`	the full design matrix.
`S`	If `fit.method` is `"magic"` then this is a one dimensional array containing the non-zero elements of the penalty matrices. Let `start[k+1]<-start[k]+H$df[1:(k-1)]^2` and `start[1]<-0`. Then penalty matrix `k` has `H$S[start[k]+i+H$df[i]*(j-1)` on its ith row and jth column. To get the kth full penalty matrix the matrix so obtained would be inserted into a full matrix of zeroes with it's 1,1 element at `H$off[k],H$off[k]`. If `fit.method` is `"mgcv"` then this is a list of penalty matrices, again stored as smallest matrices including all the non-zero elements of the penalty matrix concerned.
`off`	is an array of offsets, used to facilitate efficient storage of the penalty matrices and to indicate where in the overall parameter vector the parameters of the ith spline reside (e.g. first parameter of ith spline is at `p[off[i]+1]`).
`C`	a matrix defining the linear equality constraints on the parameters used to define the the model (i.e. C in Cp=0).
`UZ`	Array containing matrices, which transform from a t.p.r.s. basis to the equivalent t.p.s. basis (for t.p.r.s. terms only). The packing method is as follows: set `start[1]<-0` and `start[k+1]<-start[k]+(M[k]+n)tp.bs[k]` where `n` is number of data, `M[k]` is penalty null space dimension and `tp.bs[k]` is zero for a cubic regression spline and the basis dimension for a t.p.r.s. Then element `i,j` of the UZ matrix for model term `k` is: `UZ[start[k]+i+(j=1)(M[k]+n)]`.
`Xu`	Set of unique covariate combinations for each term. The packing method is as follows: set `start[1]<-0` and `start[k+1]<-start[k]+(xu.length[k])tp.dim[k]` where `xu.length[k]` is number of unique covariate combinations and `tp.dim[k]` is zero for a cubic regression spline and the dimension of the smooth (i.e. number of covariates it is a function of) for a t.p.r.s. Then element `i,j` of the Xu matrix for model term `k` is: `Xu[start[k]+i+(j=1)(xu.length[k])]`.
`xu.length`	Number of unique covariate combinations for each t.p.r.s. term.
`covariate.shift`	All covariates are centred around zero before bases are constructed - this is an array of the applied shifts.
`xp`	matrix whose rows contain the covariate values corresponding to the parameters of each cubic regression spline - the cubic regression splines are parameterized using their y- values at a series of x values - these vectors contain those x values! Note that these will be covariate shifted.
`rank`	an array giving the ranks of the penalty matrices.
`m.free`	this is only for use with `"magic"` and is the number of smoothing parameters that must be estimated.
`m.off`	again only for `"magic"`: the offests for the penalty matrices for the penalties with smoothing parameters that must be estimated.

Author(s)

Simon N. Wood simon@stats.gla.ac.uk

References

Wood, S.N. (2000) Modelling and Smoothing Parameter Estimation with Multiple Quadratic Penalties. J.R.Statist.Soc.B 62(2):413-428

Wood, S.N. (2003) Thin plate regression splines. J.R.Statist.Soc.B 65(1):95-114

http://www.stats.gla.ac.uk/~simon/

Examples

set.seed(0)
n<-100 # number of observations to simulate
x <- runif(5 * n, 0, 1) # simulate covariates
x <- array(x, dim = c(5, n)) # put into array for passing to GAMsetup
pi <- asin(1) * 2  # begin simulating some data
y <- 2 * sin(pi * x[2, ])
y <- y + exp(2 * x[3, ]) - 3.75887
y <- y + 0.2 * x[4, ]^11 * (10 * (1 - x[4, ]))^6 + 10 * (10 * 
     x[4, ])^3 * (1 - x[4, ])^10 - 1.396
sig2<- -1    # set magnitude of variance 
e <- rnorm(n, 0, sqrt(abs(sig2)))
y <- y + e          # simulated data
w <- matrix(1, n, 1) # weight matrix
par(mfrow = c(2, 2)) # scatter plots of simulated data
plot(x[2, ], y);plot(x[3, ], y);plot(x[4, ], y);plot(x[5, ], y)
x[1,]<-1
# create list for passing to GAMsetup....
G <- list(m = 4, n = n, nsdf = 0, df = c(15, 15, 15, 15),dim=c(1,1,1,1),
     s.type=c(0,0,0,0),by=0,by.exists=c(FALSE,FALSE,FALSE,FALSE),
     p.order=c(0,0,0,0),x = x,n.knots=rep(0,4),fit.method="mgcv")
H <- GAMsetup(G)
H$y <- y    # add data to H
H$sig2 <- sig2  # add variance (signalling GCV use in this case) to H
H$w <- w       # add weights to H
H$sp<-array(-1,H$m)
H$fix<-array(FALSE,H$m)
H$conv.tol<-1e-6;H$max.half<-15
H$min.edf<-5;H$fixed.sp<-0
H <- mgcv(H)  # select smoothing parameters and fit model

[Package Contents]