Multipollutant models



Last Updated: 05-06-2019 14:00


The collection of high dimensional data has different implications for our modelling approaches, as problems with multiple testing and (multi)co-linearity arise. During this session different modeling strategies are discussed to avoid these issues: dimension reduction, penalized regression (regularization) and Bayesian variable selection. Dimension reduction techniques such as principal component regression, partial least squares (PLS) regression or sparse PLS are used to project a high-dimensional X (and Y) on a smaller subspace. Penalized regression makes use of a bias-variance trade-off (so-called regularization) between model fit and penalty (size of coefficients). Ridge regression uses the L2 norm, which enforces the coefficients to be lower to minimize their impact on the model. Lasso regression uses the L1 norm, forcing the coefficients to zero, therefore the final model might include fewer features. The tuning and decision of the penalty is depending on the goal, e.g. variable selection would need a somewhat higher penalty. Elastic net (ENET) tries to combine the best of Lasso and Ridge, a weighted sum of the ridge and lasso penalties. The methods discussed are all impacted by the scale of the variables and the necessity of centering and scaling should be considered. In general model quality is based on its generalizability (reproducibility and transportability, can be assessed with cross-validation), explanation (sensitivity and specificity) and stability (independence of random features in the data).