## Introduction

`qgcomp`

is a package to implement g-computation for analyzing the effects of exposuremixtures. Quantile g-computation yields estimates of the effect of increasingall exposures by one quantile, simultaneously. This, it estimates a “mixtureeffect” useful in the study of exposure mixtures such as air pollution, diet,and water contamination.

Using terminology from methods developed for causal effect estimation, quantileg-computation estimates the parameters of a marginal structural model thatcharacterizes the change in the expected potential outcome given a joint interventionon all exposures, possibly conditional on confounders. Under the assumptions ofexchangeability, causal consistency, positivity, no interference, and correctmodel specification, this model yields a causal effect for an interventionon the mixture as a whole. While these assumptions may not be met exactly, theyprovide a useful road map for how to interpret the results of a qgcomp fit, andwhere efforts should be spent in terms of ensuring accurate model specificationand selection of exposures that are sufficient to control co-pollutant confounding.

### The model

Say we have an outcome \(Y\), some exposures \(\mathbb{X}\) and possibly some other covariates (e.g. potential confounders) denoted by \(\mathbb{Z}\).

The basic model of quantile g-computation is a joint marginal structural model given by

[\mathbb{E}(Y^{\mathbf{X}_q} | \mathbf{Z,\psi,\eta}) = g(\psi_0 + \psi_1 S_q + \mathbf{\eta Z})]

where \(g(\cdot)\) is a link function in a generalized linear model (e.g. the inverse logit function in the case of a logistic model for the probability that \(Y=1\)), \(\psi_0\) is the model intercept, \(\mathbf{\eta}\) is a set of model coefficients for the covariates and \(S_q\) is an “index” that represents a joint value of exposures. Quantile g-computation (by default) transforms all exposures \(\mathbf{X}\) into \(\mathbf{X}_q\), which are “scores” taking on discrete values 0,1,2,etc. representing a categorical “bin” of exposure. By default, there are four bins with evenly spaced quantile cutpoints for each exposure, so \({X}_q=0\) means that \(X\) was below the observed 25th percentile for that exposure. The index \(S_q\) represents all exposures being set to the same value (again, by default, discrete values 0,1,2,3). Thus, *the parameter \(\psi_1\) quantifies the expected change in the outcome, given a one quantile increase in all exposures simultaneously,* possibly adjusted for \(\mathbf{Z}\).

There are nuances to this particular model form that are available in the `qgcomp`

package which will be explored below. There exists one special case of quantile g-computation that leads to fast fitting: linear/additive exposure effects. Here we simulate “pre-quantized” data where the exposures \(X_1, X_2, X_3\) can only take on values of 0,1,2,3 in equal proportions. The model underlying the outcomes is given by the linear regression:

[\mathbb{E}(Y | \mathbf{X,\beta}) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3]

with the true values of \(\beta_0=0, \beta_1 =0.25, \beta_2 =-0.1, \beta_3=0.05\), and \(X_1\) is strongly positively correlated with \(X_2\) ($\rho=0.95$) and negatively correlated with \(X_3\) ($\rho=-0.3$). In this simple setting, the parameter \(\psi_1\) will equal the sum of the \(\beta\) coefficients (0.2). Here we see that qgcomp estimates a value very close to 0.2 (as we increase sample size, the estimated value will be expected to become increasingly close to 0.2).

`library("qgcomp")set.seed(543210)qdat = simdata_quantized(n=5000, outcomtype="continuous", cor=c(.95, -0.3), b0=0, coef=c(0.25, -0.1, 0.05), q=4)head(qdat)`

`## x1 x2 x3 y## 1 2 2 3 0.5539253## 2 2 3 1 -0.1005701## 3 2 2 3 0.3782267## 4 1 2 2 -0.4557419## 5 0 0 1 0.6124436## 6 1 1 2 0.7569574`

`cor(qdat[,c("x1", "x2", "x3")])`

`## x1 x2 x3## x1 1.00000 0.95552 -0.30368## x2 0.95552 1.00000 -0.29840## x3 -0.30368 -0.29840 1.00000`

`qgcomp(y~x1+x2+x3, expnms=c("x1", "x2", "x3"), data=qdat)`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.3)## x1 x3 ## 0.775 0.225 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.0966)## x2 ## 1 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.0016583 0.0359609 -0.07214 0.068824 -0.0461 0.9632## psi1 0.2035810 0.0219677 0.16053 0.246637 9.2673 <2e-16`

## How to use the `qgcomp`

package

Here we use a running example from the `metals`

dataset from the from the package`qgcomp`

to demonstrate some features of the package and method.

Namely, the examples below demonstrate use of the package for:

- Fast estimation of exposure effects under a linear model for quantized exposures for continuous (normal) outcomes
- Estimating conditional and marginal odds/risk ratios of a mixture effect for binary outcomes
- Adjusting for non-exposure covariates when estimating effects of the mixture
- Allowing non-linear and non-homogeneous effects of individual exposures and the mixture as a whole by including product terms
- Using qgcomp to fit a time-to-event model to estimate conditional and marginal hazard ratios for the exposure mixture

For analogous approaches to estimating exposure mixture effects, illustrative examples can be seen in the `gQWS`

package help files, which implementsweighted quantile sum (WQS) regression, and at https://jenfb.github.io/bkmr/overview.html, which describes Bayesian kernel machine regression.

The `metals`

dataset from the from the package `qgcomp`

, comprises a set of simulated well water exposures and two health outcomes (one continuous, one binary/time-to-event). The exposures are transformed to have mean = 0.0, standard deviation = 1.0. The data are used throughout to demonstrate usage and features of the `qgcomp`

package.

`library("ggplot2")data("metals", package="qgcomp")head(metals)`

`## arsenic barium cadmium calcium chloride chromium## 1 0.09100165 0.08166362 15.0738845 -0.7746662 -0.15408335 -0.05589104## 2 0.17018302 -0.03598828 -0.7126486 -0.6857886 -0.19605499 -0.03268488## 3 0.13336869 0.09934014 0.6441992 -0.1525231 -0.17511844 -0.01161098## 4 -0.52570747 -0.76616263 -0.8610256 1.4472733 0.02552401 -0.05173287## 5 0.43420529 0.40629920 0.0570890 0.4103682 -0.24187403 -0.08931824## 6 0.71832662 0.19559582 -0.6823437 -0.8931696 -0.03919936 -0.07389407## copper iron lead magnesium manganese mercury## 1 1.99438050 19.1153352 21.072630908 -0.5109546 2.07630966 -1.20826726## 2 -0.02490169 -0.2039425 -0.010378362 -0.1030542 -0.36095395 -0.68729723## 3 0.25700811 -0.1964581 -0.063375935 0.9166969 -0.31075240 0.44852503## 4 0.75477075 -0.2317787 -0.002847991 2.5482987 -0.23350205 0.20428158## 5 -0.09919923 -0.1698619 -0.035276281 -0.5109546 0.08825996 1.19283834## 6 -0.05622285 -0.2129300 -0.118460981 -1.0059145 -0.30219838 0.02875033## nitrate nitrite ph selenium silver sodium## 1 1.3649492 -1.0500539 -0.7125482 0.23467592 -0.8648653 -0.41840695## 2 -0.1478382 0.4645119 0.9443009 0.65827253 -0.8019173 -0.09112969## 3 -0.3001660 -1.4969868 0.4924330 0.07205576 -0.3600140 -0.11828963## 4 0.3431814 -0.6992263 -0.4113029 0.23810705 1.3595205 -0.11828963## 5 0.0431269 -0.5041390 0.3418103 -0.02359910 -1.6078044 -0.40075299## 6 -0.3986575 0.1166249 1.2455462 -0.61186017 1.3769466 1.83722597## sulfate total_alkalinity total_hardness zinc mage35 y## 1 -0.1757544 -1.31353389 -0.85822417 1.0186058 1 -0.6007989## 2 -0.1161359 -0.12699789 -0.67749970 -0.1509129 0 -0.2022296## 3 -0.1616806 0.42671890 0.07928399 -0.1542524 0 -1.2164116## 4 0.8272415 0.99173604 1.99948142 0.1843372 0 0.1826311## 5 -0.1726845 -0.04789549 0.30518957 -0.1529379 0 1.1760472## 6 -0.1385631 1.98616621 -1.07283447 -0.1290391 0 -0.4100912## disease_time disease_state## 1 6.168764e-07 1## 2 4.000000e+00 0## 3 4.000000e+00 0## 4 4.000000e+00 0## 5 1.813458e+00 1## 6 2.373849e+00 1`

### Example 1: linear model

`# we save the names of the mixture variables in the variable "Xnm"Xnm <- c( 'arsenic','barium','cadmium','calcium','chromium','copper', 'iron','lead','magnesium','manganese','mercury','selenium','silver', 'sodium','zinc')covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')# Example 1: linear model# Run the model and save the results "qc.fit"system.time(qc.fit <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, 'y')], family=gaussian()))`

`## Including all model terms as exposures of interest`

`## user system elapsed ## 0.012 0.000 0.012`

`# user system elapsed # 0.011 0.002 0.018 # contrasting other methods with computational speed# WQS regression (v3.0.1 of gWQS package)#system.time(wqs.fit <- gWQS::gwqs(y~wqs,mix_name=Xnm, data=metals[,c(Xnm, 'y')], family="gaussian"))# user system elapsed # 35.775 0.124 36.114 # Bayesian kernel machine regression (note that the number of iterations here would # need to be >5,000, at minimum, so this underestimates the run time by a factor# of 50+#system.time(bkmr.fit <- kmbayes(y=metals$y, Z=metals[,Xnm], family="gaussian", iter=100))# user system elapsed # 81.644 4.194 86.520 `

First note that qgcomp can be very fast relative to competing methods (with their example times given from single runs from a laptop).

One advantage of quantile g-computation over other methods that estimate“mixture effects” (the effect of changing all exposures at once), is that itis very computationally efficient. Contrasting methods such as WQS (`gWQS`

package) and Bayesian Kernel Machine regression (`bkmr`

package),quantile g-computation can provide results many orders of magnitude faster.For example, the example above ran 3000X faster for quantile g-computationversus WQS regression, and we estimate the speedup would be severalhundred thousand times versus Bayesian kernel machine regression.

The speed relies on an efficient method to fit qgcomp when exposures are added additively to the model. When exposures are added using non-linear terms or non-additive terms (see below for examples), then qgcomp will be slower but often still faster than competetive approaches.

Quantile g-computation yields fixed weights in the estimation procedure, similarto WQS regression. However, note that the weights from `qgcomp.glm.noboot`

can be negative or positive. When all effects are linear and in the samedirection (“directional homogeneity”), quantile g-computation is equivalent toweighted quantile sum regression in large samples.

The overall mixture effect from quantile g-computation (psi1) is interpreted asthe effect on the outcome of increasing every exposure by one quantile, possiblyconditional on covariates. Given the overall exposure effect, the weights areconsidered fixed and so do not have confidence intervals or p-values.

`# View results: scaled coefficients/weights and statistical inference about# mixture effectqc.fit`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.39)## calcium iron barium silver arsenic mercury sodium chromium ## 0.72216 0.06187 0.05947 0.03508 0.03447 0.02451 0.02162 0.02057 ## cadmium zinc ## 0.01328 0.00696 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.124)## magnesium copper lead manganese selenium ## 0.475999 0.385299 0.074019 0.063828 0.000857 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.356670 0.107878 -0.56811 -0.14523 -3.3062 0.0010238## psi1 0.266394 0.071025 0.12719 0.40560 3.7507 0.0002001`

Now let’s take a brief look under the hood. `qgcomp`

works in steps. First, the exposure variables are “quantized” or turned into score variables based on the total number of quantiles from the parameter `q`

. You can access these via the `qx`

object from the `qgcomp`

fit object.

`# quantized datahead(qc.fit$qx)`

`## arsenic_q barium_q cadmium_q calcium_q chromium_q copper_q iron_q lead_q## 1 2 2 3 0 1 3 3 3## 2 2 2 0 0 2 2 1 2## 3 2 2 3 1 3 3 1 1## 4 1 0 0 3 1 3 0 2## 5 2 3 2 3 0 1 2 2## 6 3 2 0 0 0 2 1 1## magnesium_q manganese_q mercury_q selenium_q silver_q sodium_q zinc_q## 1 1 3 0 2 0 0 3## 2 2 0 1 3 1 3 0## 3 3 1 2 2 1 3 0## 4 3 1 2 2 3 3 3## 5 1 3 3 1 0 0 0## 6 0 1 2 0 3 3 2`

You can re-fit a linear model using these quantized exposures. This is the “underlying model” of a qgcomp fit.

`# regression with quantized dataqc.fit$qx$y = qc.fit$fit$data$y # first bring outcome back into the quantized datanewfit <- lm(y ~ arsenic_q + barium_q + cadmium_q + calcium_q + chromium_q + copper_q + iron_q + lead_q + magnesium_q + manganese_q + mercury_q + selenium_q + silver_q + sodium_q + zinc_q, data=qc.fit$qx)newfit`

`## ## Call:## lm(formula = y ~ arsenic_q + barium_q + cadmium_q + calcium_q + ## chromium_q + copper_q + iron_q + lead_q + magnesium_q + manganese_q + ## mercury_q + selenium_q + silver_q + sodium_q + zinc_q, data = qc.fit$qx)## ## Coefficients:## (Intercept) arsenic_q barium_q cadmium_q calcium_q chromium_q ## -0.3566699 0.0134440 0.0231929 0.0051773 0.2816176 0.0080231 ## copper_q iron_q lead_q magnesium_q manganese_q mercury_q ## -0.0476113 0.0241278 -0.0091464 -0.0588190 -0.0078872 0.0095572 ## selenium_q silver_q sodium_q zinc_q ## -0.0001059 0.0136815 0.0084302 0.0027125`

Here you can see that, *for a GLM in which all quantized exposures enter linearly and additively into the underlying model* the overall effect from `qgcomp`

is simply the sum of the adjusted coefficients from the underlying model.

`sum(newfit$coefficients[-1]) # sum of all coefficients excluding intercept and confounders, if any`

`## [1] 0.2663942`

`coef(qc.fit) # overall effect and intercept from qgcomp fit`

`## (intercept) psi1 ## -0.3566699 0.2663942`

This equality is why we can fit qgcomp so efficiently under such a model, but `qgcomp`

is a much more general method that can allow for non-linearity and non-additivity in the underlying model, as well as non-linearity in the overall model. These extensions are described in some of the following examples.

### Example 2: conditional odds ratio, marginal odds ratio in a logistic model

This example introduces the use of a binary outcome in `qgcomp`

via the`qgcomp.glm.noboot`

function, which yields a conditional odds ratio or the`qgcomp.glm.boot`

, which yields a marginal odds ratio or risk/prevalence ratio. Thesewill not equal each other when there are non-exposure covariates (e.g.confounders) included in the model because the odds ratio is not collapsible (bothare still valid). Marginal parameters will yield estimates of the populationaverage exposure effect, which is often of more interest due to betterinterpretability over conditional odds ratios. Further, odds ratios are notgenerally of interest when risk ratios can be validly estimated, so `qgcomp.glm.boot`

will estimate the risk ratio by default for binary data (set rr=FALSE toallow estimation of ORs when using `qgcomp.glm.boot`

).

`qc.fit2 <- qgcomp.glm.noboot(disease_state~., expnms=Xnm, data = metals[,c(Xnm, 'disease_state')], family=binomial(), q=4)qcboot.fit2 <- qgcomp.glm.boot(disease_state~., expnms=Xnm, data = metals[,c(Xnm, 'disease_state')], family=binomial(), q=4, B=10,# B should be 200-500+ in practice seed=125, rr=FALSE)qcboot.fit2b <- qgcomp.glm.boot(disease_state~., expnms=Xnm, data = metals[,c(Xnm, 'disease_state')], family=binomial(), q=4, B=10,# B should be 200-500+ in practice seed=125, rr=TRUE)`

Compare a qgcomp.glm.noboot fit:

`qc.fit2`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.392)## barium zinc chromium magnesium silver sodium ## 0.3520 0.2002 0.1603 0.1292 0.0937 0.0645 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.696)## selenium copper arsenic calcium manganese cadmium mercury lead ## 0.2969 0.1627 0.1272 0.1233 0.1033 0.0643 0.0485 0.0430 ## iron ## 0.0309 ## ## Mixture log(OR) (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## (Intercept) 0.26362 0.51615 -0.74802 1.27526 0.5107 0.6095## psi1 -0.30416 0.34018 -0.97090 0.36258 -0.8941 0.3713`

with a qgcomp.glm.boot fit:

`qcboot.fit2`

`## Mixture log(OR) (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## (Intercept) 0.26362 0.39038 -0.50151 1.02875 0.6753 0.4995## psi1 -0.30416 0.28009 -0.85314 0.24481 -1.0859 0.2775`

with a qgcomp.glm.boot fit, where the risk/prevalence ratio is estimated, rather than the odds ratio:

`qcboot.fit2b`

`## Mixture log(RR) (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## (Intercept) -0.56237 0.16106 -0.87804 -0.24670 -3.4917 0.00048## psi1 -0.16373 0.13839 -0.43497 0.10752 -1.1830 0.23679`

### Example 3: adjusting for covariates, plotting estimates

In the following code we run a maternal age-adjusted linear model with`qgcomp`

(`family = "gaussian"`

). Further, we plot both the weights, as well as the mixture slopewhich yields overall model confidence bounds, representing the bounds that, for each value of thejoint exposure are expected to contain the true regression line over 95% of trials (so-called 95% ‘pointwise’ bounds for the regression line). The pointwise comparison bounds, denoted by error bars on the plot, represent comparisons of the expected difference in outcomes at each quantile, with referenceto a specific quantile (which can be specified by the user, as below). These pointwise bounds are similar to the boundscreated in the bkmr package when plotting the overall effect of all exposures. The pointwise bounds can be obtained via the pointwisebound.boot function. To avoid confusion between “pointwise regression” and “pointwise comparison” bounds, the pointwise regression bounds are denoted as the “model confidence band” in the plots, since they yield estimates of the same type of bounds as the `predict`

function in R when applied to linear model fits.

Note that the underlying regression model is on the quantile ‘score’, which takes on values integer values0, 1, …, q-1. For plotting purposes (when plotting regression line results from qgcomp.glm.boot),the quantile score is translated into a quantile (range = [0-1]). This is not a perfect correspondence,because the quantile g-computation model treats the quantile score as a continuous variable, but the quantile category comprises a range of quantiles.For visualization, we fix the ends of the plot at the mid-points of the first and last quantile cut-point,so the range of the plot will change slightly if ‘q’ is changed.

`qc.fit3 <- qgcomp.glm.noboot(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride + chromium + copper + iron + lead + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=Xnm, metals, family=gaussian(), q=4)qc.fit3`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.381)## calcium barium iron silver arsenic mercury chromium zinc ## 0.74466 0.06636 0.04839 0.03765 0.02823 0.02705 0.02344 0.01103 ## sodium cadmium ## 0.00775 0.00543 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.124)## magnesium copper lead manganese selenium ## 0.49578 0.35446 0.08511 0.06094 0.00372 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.348084 0.108037 -0.55983 -0.13634 -3.2219 0.0013688## psi1 0.256969 0.071459 0.11691 0.39703 3.5960 0.0003601`

`plot(qc.fit3)`

![plot of chunk adjusting for covariates a](figure/adjusting for covariates a-1.png)

From the first plot we see weights from `qgcomp.glm.noboot`

function, which include bothpositive and negative effect directions. When the weights are all on a single side of the null,these plots are easy to in interpret since the weight corresponds to the proportion of theoverall effect from each exposure. WQS uses a constraint in the model to forceall of the weights to be in the same direction - unfortunately such constraintslead to biased effect estimates. The `qgcomp`

package takes a different approachand allows that “weights” might go in either direction, indicating that some exposuresmay beneficial, and some harmful, or there may be sampling variation due to usingsmall or moderate sample sizes (or, more often, systematic bias such as unmeasuredconfounding). The “weights” in `qgcomp`

correspond to the proportion of the overall effectwhen all of the exposures have effects in the same direction, but otherwise theycorrespond to the proportion of the effect *in a particular direction*, whichmay be small (or large) compared to the overall “mixture” effect. NOTE: the leftand right sides of the plot should not be compared with each other because thelength of the bars corresponds to the effect size only relative to other effectsin the same direction. The darkness of the bars corresponds to the overall effectsize - in this case the bars on the right (positive) side of the plot are darkerbecause the overall “mixture” effect is positive. Thus, the shading allows oneto make informal comparisons across the left and right sides: a large, darklyshaded bar indicates a larger independent effect than a large, lightly shaded bar.

`qcboot.fit3 <- qgcomp.glm.boot(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride + chromium + copper + iron + lead + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=Xnm, metals, family=gaussian(), q=4, B=10,# B should be 200-500+ in practice seed=125)qcboot.fit3`

`## Mixture slope parameters (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.342787 0.114983 -0.56815 -0.11742 -2.9812 0.0030331## psi1 0.256969 0.075029 0.10991 0.40402 3.4249 0.0006736`

`p = plot(qcboot.fit3)`

![plot of chunk adjusting for covariates b](figure/adjusting for covariates b-1.png)

We can change the referent category for pointwise comparisons via the `pointwiseref`

parameter:

`plot(qcboot.fit3, pointwiseref = 3)`

![plot of chunk adjusting for covariates c](figure/adjusting for covariates c-1.png)

Using `qgcomp.glm.boot`

also allows us to assesslinearity of the total exposure effect (the second plot). Similar output is availablefor WQS (`gWQS`

package), though WQS results will generally be less interpretablewhen exposure effects are non-linear (see below how to do this with `qgcomp.glm.boot`

).

The plot for the `qcboot.fit3`

object (using g-computation with bootstrap variance)gives predictions at the joint intervention levels of exposure. It also displaysa smoothed (graphical) fit.

Note that the uncertainty intervals given in the plot are directly accessible via the `pointwisebound`

(pointwise comparison confidence intervals) and `modelbound`

functions (confidence interval for the regression line):

`pointwisebound.boot(qcboot.fit3, pointwiseref=3)`

`## quantile quantile.midpoint linpred mean.diff se.diff ll.diff## 0 0 0.125 -0.34278746 -0.5139387 0.15005846 -0.8080479## 1 1 0.375 -0.08581809 -0.2569694 0.07502923 -0.4040240## 2 2 0.625 0.17115127 0.0000000 0.00000000 0.0000000## 3 3 0.875 0.42812064 0.2569694 0.07502923 0.1099148## ul.diff ll.linpred ul.linpred## 0 -0.2198295 -0.6368966 -0.04867828## 1 -0.1099148 -0.2328727 0.06123650## 2 0.0000000 0.1711513 0.17115127## 3 0.4040240 0.2810660 0.57517523`

`qgcomp:::modelbound.boot(qcboot.fit3)`

`## quantile quantile.midpoint linpred m se.pw ll.pw## 0 0 0.125 -0.34278746 -0.34278746 0.11498301 -0.56815001## 1 1 0.375 -0.08581809 -0.08581809 0.04251388 -0.16914376## 2 2 0.625 0.17115127 0.17115127 0.04065143 0.09147593## 3 3 0.875 0.42812064 0.42812064 0.11294432 0.20675384## ul.pw ll.simul ul.simul## 0 -0.117424905 -0.4622572 -0.161947578## 1 -0.002492425 -0.1191843 -0.005233921## 2 0.250826606 0.1386622 0.223888629## 3 0.649487428 0.3081934 0.566961520`

Because qgcomp estimates a joint effect of multiple exposures, we cannot, in general, assess model fit by overlaying predictions from the plots above with the data. Hence, it is useful to explore non-linearity by fitting models thatallow for non-linear effects, as in the next example.

### Example 4: non-linearity (and non-homogeneity)

`qgcomp`

(and `qgcomp.glm.boot`

) addresses non-linearity in a way similar to standard parametric regression models, which lends itself to being able to leverage R language features for non-linear parametric models (or, more precisely, parametric models that deviate from a purely additive, linear function on the link function basis via the use of basis function representation of non-linear functions).Here is an example where we use a feature of the R language for fitting modelswith interaction terms. We use `y~. + .^2`

as the model formula, which fits a modelthat allows for quadratic term for every predictor in the model.

Similar approaches could be used to include interaction terms between exposures,as well as between exposures and covariates.

`qcboot.fit4 <- qgcomp(y~. + .^2, expnms=Xnm, metals[,c(Xnm, 'y')], family=gaussian(), q=4, B=10, seed=125)plot(qcboot.fit4)`

![plot of chunk non-linear non-hom intro](figure/non-linear non-hom intro-1.png)

Note that allowing for a non-linear effect of all exposures induces an apparentnon-linear trend in the overall exposure effect. The smoothed regression line isstill well within the confidence bands of the marginal linear model(by default, the overall effect of joint exposure is assumed linear,though this assumption can be relaxed via the ‘degree’ parameter in qgcomp.glm.boot,as follows:

`qcboot.fit5 <- qgcomp(y~. + .^2, expnms=Xnm, metals[,c(Xnm, 'y')], family=gaussian(), q=4, degree=2, B=10, rr=FALSE, seed=125)plot(qcboot.fit5)`

![plot of chunk overall non-linearity](figure/overall non-linearity-1.png)

Once again, we can access numerical estimates of uncertainty:

`qgcomp::pointwisebound.boot(qcboot.fit5)`

`## quantile quantile.midpoint linpred mean.diff se.diff ll.diff## 0 0 0.125 -0.89239044 0.0000000 0.0000000 0.0000000## 1 1 0.375 -0.18559680 0.7067936 0.6165306 -0.5015841## 2 2 0.625 0.12180659 1.0141970 0.6044460 -0.1704954## 3 3 0.875 0.02981974 0.9222102 0.3537861 0.2288022## ul.diff ll.linpred ul.linpred## 0 0.000000 -0.8923904 -0.8923904## 1 1.915171 -1.3939746 1.0227810## 2 2.198889 -1.0628859 1.3064991## 3 1.615618 -0.6635882 0.7232277`

`qgcomp:::modelbound.boot(qcboot.fit5)`

`## quantile quantile.midpoint linpred m se.pw ll.pw## 0 0 0.125 -0.89239044 -0.89239044 0.70335801 -2.2709468## 1 1 0.375 -0.18559680 -0.18559680 0.09874085 -0.3791253## 2 2 0.625 0.12180659 0.12180659 0.14624914 -0.1648365## 3 3 0.875 0.02981974 0.02981974 0.83609757 -1.6089014## ul.pw ll.simul ul.simul## 0 0.486165922 -2.08813110 0.18308760## 1 0.007931712 -0.32577422 -0.02886665## 2 0.408449640 -0.06371183 0.43431431## 3 1.668540859 -1.19007238 1.57263047`

Ideally, the smooth fit will look very similar to the model prediction regressionline.

#### Interpretation of model parameters

As the output below shows, setting “degree=2” yields a second parameter in the model fit ($\psi_2$). The output of qgcomp now corresponds to estimates of the marginal structural model given by[\mathbb{E}(Y^{\mathbf{X}_q}) = g(\psi_0 + \psi_1 S_q + \psi_2 S_q^2)]

`qcboot.fit5`

`## Mixture slope parameters (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.89239 0.70336 -2.27095 0.48617 -1.2688 0.2055## psi1 0.90649 0.93820 -0.93235 2.74533 0.9662 0.3347## psi2 -0.19970 0.32507 -0.83682 0.43743 -0.6143 0.5395`

so that \(\psi_2\) can be interpreted similar to quadratic terms that might appear in a generalized linear model. \(\psi_2\) estimates the change in the outcome for an additional unit of squared joint exposure, over-and-above the linear effect given by \(\psi_1\). Informally, this is a way of formally assessing specific types of non-linearity in the joint exposure-response curves, and there are many other (slightly incorrect but intuitively useful) ways of interpreting parameters for squared terms in regressions (beyond the scope of this document). Intuition from generalized linear models applies directly to the models fit by quantile g-computation.

### Example 5: Comparing model fits and further exploring non-linearity

Exploring a non-linear fit in settings with multiple exposures is challenging. One way to explore non-linearity, as demonstrated above, is to to include all 2-way interaction terms (including quadratic terms, or “self-interactions”). Sometimes this approach is not desired, either because the number of terms in the model can become very large, or because some sort of model selection procedure is required, which risks inducing over-fit (biased estimates and standard errors that are too small). Short of having a set of a priori non-linear terms to include, we find it best to take a default approach (e.g. taking all second order terms) that doesn’t rely on statistical significance, or to simply be honest that the search for a non-linear model is exploratory and shouldn’t be relied upon for robust inference. Methods such as kernel machine regression may be good alternatives, or supplementary approaches to exploring non-linearity.

NOTE: qgcomp necessarily fits a regression model with exposures that have a small number of possible values, based on the quantile chosen. By package default, this is `q=4`

, but it is difficult to fully examine non-linear fits using only four points, so we recommend exploring larger values of `q`

, which will change effect estimates (i.e. the model coefficient implies a smaller change in exposures, so the expected change in the outcome will also decrease).

Here, we examine a one strategy for default and exploratory approaches to mixtures that can be implemented in qgcomp using a smaller subset of exposures (iron, lead, cadmium), which we choose via the correlation matrix. High correlations between exposures may result from a common source, so small subsets of the mixture may be useful for examining hypotheses that relate to interventions on a common environmental source or set of behaviors. Note that we can still adjust for the measured exposures, even though only 3 our exposures of interest are considered as the mixture of interest. Note that we will require a new R package to help in exploring non-linearity: `splines`

. Note that `qgcomp.glm.boot`

must be used in order to produce the graphics below, as `qgcomp.glm.noboot`

does not calculate the necessary quantities.

#### Graphical approach to explore non-linearity in a correlated subset of exposures using splines

`library(splines)# find all correlations > 0.6 (this is an arbitrary choice)cormat = cor(metals[,Xnm])idx = which(cormat>0.6 & cormat <1.0, arr.ind = TRUE)newXnm = unique(rownames(idx)) # iron, lead, and cadmiumqc.fit6lin <- qgcomp.glm.boot(y ~ iron + lead + cadmium + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=newXnm, metals, family=gaussian(), q=8, B=10)qc.fit6nonlin <- qgcomp.glm.boot(y ~ bs(iron) + bs(cadmium) + bs(lead) + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=newXnm, metals, family=gaussian(), q=8, B=10, degree=2)qc.fit6nonhom <- qgcomp.glm.boot(y ~ bs(iron)*bs(lead) + bs(iron)*bs(cadmium) + bs(lead)*bs(cadmium) + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=newXnm, metals, family=gaussian(), q=8, B=10, degree=3)`

It helps to place the plots on a common y-axis, which is easy due to dependence of the qgcomp plotting functions on ggplot. Here’s the linear fit :

`pl.fit6lin <- plot(qc.fit6lin, suppressprint = TRUE, pointwiseref = 4)pl.fit6lin + coord_cartesian(ylim=c(-0.75, .75)) + ggtitle("Linear fit: mixture of iron, lead, and cadmium")`

![plot of chunk graphical non-linearity 1b](figure/graphical non-linearity 1b-1.png)

Here’s the non-linear fit :

`pl.fit6nonlin <- plot(qc.fit6nonlin, suppressprint = TRUE, pointwiseref = 4)pl.fit6nonlin + coord_cartesian(ylim=c(-0.75, .75)) + ggtitle("Non-linear fit: mixture of iron, lead, and cadmium")`

![plot of chunk graphical non-linearity 2](figure/graphical non-linearity 2-1.png)

And here’s the non-linear fit with statistical interaction between exposures (recalling that this will lead to non-linearity in the overall effect):

`pl.fit6nonhom <- plot(qc.fit6nonhom, suppressprint = TRUE, pointwiseref = 4)pl.fit6nonhom + coord_cartesian(ylim=c(-0.75, .75)) + ggtitle("Non-linear, non-homogeneous fit: mixture of iron, lead, and cadmium")`

![plot of chunk graphical non-linearity 3](figure/graphical non-linearity 3-1.png)

#### Caution about graphical approaches

The underlying conditional model fit can be made extremely flexible, and the graphical representation of this (via thesmooth conditional fit) can look extremely flexible. Simply matching the overall (MSM) fit to this line is nota viable strategy for identifying parsimonious models because that would ignore potential for overfit. Thus,caution should be used when judging the accuracy of a fit when comparing the “smooth conditional fit” to the“MSM fit.”

`qc.overfit <- qgcomp.glm.boot(y ~ bs(iron) + bs(cadmium) + bs(lead) + mage35 + bs(arsenic) + bs(magnesium) + bs(manganese) + bs(mercury) + bs(selenium) + bs(silver) + bs(sodium) + bs(zinc), expnms=Xnm, metals, family=gaussian(), q=8, B=10, degree=1)qc.overfit`

`## Mixture slope parameters (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.064420 0.123768 -0.307001 0.178162 -0.5205 0.6030## psi1 0.029869 0.031972 -0.032795 0.092534 0.9342 0.3507`

`plot(qc.overfit, pointwiseref = 5)`

![plot of chunk graphical caution](figure/graphical caution-1.png)

Here, there is little statistical evidence for even a linear trend, which makes thesmoothed conditional fit appear to be overfit. The smooth conditional fit can be turned off, as below.

`plot(qc.overfit, flexfit = FALSE, pointwiseref = 5)`

![plot of chunk graphical caution 2](figure/graphical caution 2-1.png)

### Example 6: Miscellaneous other ways to allow non-linearity.

Note that these are included as examples of *how* to include non-linearities, and are not intended asa demonstration of appropriate model selection. In fact, qc.fit7b is generally a bad idea in smallto moderate sample sizes due to large numbers of parameters.

#### using indicator terms for each quantile

`qc.fit7a <- qgcomp.glm.boot(y ~ factor(iron) + lead + cadmium + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=newXnm, metals, family=gaussian(), q=8, B=20, deg=2)# underlying fitsummary(qc.fit7a$fit)$coefficients`

`## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.052981109 0.08062430 -0.6571357 5.114428e-01## factor(iron)1 0.046571725 0.09466914 0.4919420 6.230096e-01## factor(iron)2 -0.056984300 0.09581659 -0.5947227 5.523395e-01## factor(iron)3 0.143813131 0.09558055 1.5046276 1.331489e-01## factor(iron)4 0.053319057 0.09642069 0.5529835 5.805601e-01## factor(iron)5 0.303967859 0.09743261 3.1197753 1.930856e-03## factor(iron)6 0.246259568 0.09734385 2.5297907 1.176673e-02## factor(iron)7 0.447045591 0.09786201 4.5681217 6.425565e-06## lead -0.009210341 0.01046668 -0.8799679 3.793648e-01## cadmium -0.010503041 0.01086440 -0.9667387 3.342143e-01## mage35 0.081114695 0.07274583 1.1150426 2.654507e-01## arsenic 0.021755516 0.02605850 0.8348720 4.042502e-01## magnesium -0.010758356 0.02469893 -0.4355798 6.633587e-01## manganese 0.004418266 0.02551449 0.1731670 8.626011e-01## mercury 0.003913896 0.02448078 0.1598763 8.730531e-01## selenium -0.058085344 0.05714805 -1.0164012 3.100059e-01## silver 0.020971562 0.02407397 0.8711302 3.841658e-01## sodium -0.062086322 0.02404447 -2.5821454 1.014626e-02## zinc 0.017078438 0.02392381 0.7138679 4.756935e-01`

`plot(qc.fit7a)`

![plot of chunk non-linear examples](figure/non-linear examples-1.png)

#### interactions between indicator terms

`qc.fit7b <- qgcomp.glm.boot(y ~ factor(iron)*factor(lead) + cadmium + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=newXnm, metals, family=gaussian(), q=8, B=10, deg=3)# underlying fit#summary(qc.fit7b$fit)$coefficientsplot(qc.fit7b)`

![plot of chunk non-linear examples 2](figure/non-linear examples 2-1.png)

#### breaks at specific quantiles (these breaks act on the quantized basis)

`qc.fit7c <- qgcomp.glm.boot(y ~ I(iron>4)*I(lead>4) + cadmium + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc, expnms=newXnm, metals, family=gaussian(), q=8, B=10, deg=2)# underlying fitsummary(qc.fit7c$fit)$coefficients`

`## Estimate Std. Error t value## (Intercept) -5.910113e-02 0.05182385 -1.140423351## I(iron > 4)TRUE 3.649940e-01 0.06448858 5.659824144## I(lead > 4)TRUE -9.004067e-05 0.06181587 -0.001456595## cadmium -6.874749e-03 0.01078339 -0.637531252## mage35 7.613672e-02 0.07255110 1.049422029## arsenic 2.042370e-02 0.02578001 0.792230124## magnesium -3.279980e-03 0.02427513 -0.135116878## manganese 1.055979e-02 0.02477453 0.426235507## mercury 9.396898e-03 0.02435057 0.385900466## selenium -4.337729e-02 0.05670006 -0.765030761## silver 1.807248e-02 0.02391112 0.755819125## sodium -5.537968e-02 0.02403808 -2.303831424## zinc 2.349906e-02 0.02385762 0.984970996## I(iron > 4)TRUE:I(lead > 4)TRUE -1.828835e-01 0.10277790 -1.779405131## Pr(>|t|)## (Intercept) 2.547332e-01## I(iron > 4)TRUE 2.743032e-08## I(lead > 4)TRUE 9.988385e-01## cadmium 5.241120e-01## mage35 2.945626e-01## arsenic 4.286554e-01## magnesium 8.925815e-01## manganese 6.701456e-01## mercury 6.997578e-01## selenium 4.446652e-01## silver 4.501639e-01## sodium 2.169944e-02## zinc 3.251821e-01## I(iron > 4)TRUE:I(lead > 4)TRUE 7.586670e-02`

`plot(qc.fit7c)`

![plot of chunk non-linear examples 3](figure/non-linear examples 3-1.png)

Note one restriction on exploring non-linearity: while we can use flexible functions such as splines for individual exposures, the overall fit is limited via the `degree`

parameter to polynomial functions (here a quadratic polynomial fits the non-linear model well, and a cubic polynomial fits the non-linear/non-homogeneous model well - though this is an informal argument and does not account for the wide confidence intervals). We note here that only 10 bootstrap iterations are used to calculate confidence intervals (to increase computational speed for the example), which is far too low.

#### Statistical approach explore non-linearity in a correlated subset of exposures using splines

The graphical approaches don’t give a clear picture of which model might be preferred, but we can compare the model fits using AIC, or BIC (information criterion that weigh model fit with over-parameterization). Both of these criterion suggest the linear model fits best (lowest AIC and BIC), which suggests that the apparently non-linear fits observed in the graphical approaches don’t improve prediction of the health outcome, relative to the linear fit, due to the increase in variance associated with including more parameters.

`AIC(qc.fit6lin$fit)`

`## [1] 676.0431`

`AIC(qc.fit6nonlin$fit)`

`## [1] 682.7442`

`AIC(qc.fit6nonhom$fit)`

`## [1] 705.6187`

`BIC(qc.fit6lin$fit)`

`## [1] 733.6346`

`BIC(qc.fit6nonlin$fit)`

`## [1] 765.0178`

`BIC(qc.fit6nonhom$fit)`

`## [1] 898.9617`

### Example 7: time-to-event analysis and parallel processing

- The
`qgcomp`

package utilizes the Cox proportional hazards models as the underlying model fortime-to-event analysis. The interpretation of a`qgcomp.glm.noboot`

fit parameter is a conditional (on confounders)hazard ratio for increasing all exposures at once. The`qc.survfit1`

object demonstrates a time-to-event analysis with`qgcompcox.noboot`

. The default plot is similar to that of`qgcompcox.noboot`

,in that it yields weights and an overall mixture effect

`# non-bootstrapped version estimates a marginal structural model for the # confounder-conditional effectsurvival::coxph(survival::Surv(disease_time, disease_state) ~ iron + lead + cadmium + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc + mage35, data=metals)`

`## Call:## survival::coxph(formula = survival::Surv(disease_time, disease_state) ~ ## iron + lead + cadmium + arsenic + magnesium + manganese + ## mercury + selenium + silver + sodium + zinc + mage35, ## data = metals)## ## coef exp(coef) se(coef) z p## iron -0.056447 0.945117 0.156178 -0.361 0.7178## lead 0.440735 1.553849 0.203264 2.168 0.0301## cadmium 0.023325 1.023599 0.105502 0.221 0.8250## arsenic -0.003812 0.996195 0.088897 -0.043 0.9658## magnesium 0.099399 1.104507 0.064730 1.536 0.1246## manganese -0.014065 0.986033 0.064197 -0.219 0.8266## mercury -0.060830 0.940983 0.072918 -0.834 0.4042## selenium -0.231626 0.793243 0.173655 -1.334 0.1823## silver 0.043169 1.044114 0.070291 0.614 0.5391## sodium 0.057928 1.059638 0.063883 0.907 0.3645## zinc 0.057169 1.058835 0.047875 1.194 0.2324## mage35 -0.458696 0.632107 0.238370 -1.924 0.0543## ## Likelihood ratio test=23.52 on 12 df, p=0.02364## n= 452, number of events= 205`

`qc.survfit1 <- qgcomp.cox.noboot(survival::Surv(disease_time, disease_state) ~ .,expnms=Xnm, data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4)qc.survfit1`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.32)## barium zinc magnesium chromium silver sodium iron ## 0.3432 0.1946 0.1917 0.1119 0.0924 0.0511 0.0151 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.554)## selenium copper calcium arsenic manganese cadmium lead mercury ## 0.2705 0.1826 0.1666 0.1085 0.0974 0.0794 0.0483 0.0466 ## ## Mixture log(hazard ratio) (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## psi1 -0.23356 0.24535 -0.71444 0.24732 -0.9519 0.3411`

`plot(qc.survfit1)`

Marginal hazards ratios (and bootstrapped quantile g-computation in general) uses a slightly different approach to effect estimation that makes it more computationally demanding than other `qcomp`

functions.

To estimate a marginal hazards ratio, the underlying model is fit, and then new outcomes are simulated under the underlying model with a baseline hazard estimator (Efron’s) - this simulation requires a large sample (controlled by MCsize) for accuracy. This approach is similar to other g-computation approaches to survival analysis, but this approach uses the exact survival times, rather than discretized survival times as are common in most g-computation analysis. Plotting a `qgcompcox.boot`

object yields a set of survival curves (e.g.`qc.survfit2`

) which comprise estimated survival curves (assuming censoring and late entry at random, conditional on covariates in the model) that characterize conditional survival functions (i.e. censoring competing risks) at various levels of joint-exposure (including the overall average - which may be slightly different from the observed survival curve, but should more or less agree).

`# bootstrapped version estimates a marginal structural model for the population average effect#library(survival)qc.survfit2 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ .,expnms=Xnm, data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4, B=5, MCsize=1000, parallel=TRUE, parplan=TRUE)qc.survfit2`

`## Mixture log(hazard ratio) (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## psi1 -0.24020 0.14942 -0.53305 0.052653 -1.6076 0.1079`

`# testing proportional hazards (note that x=TRUE is not needed (and will cause an error if used))survival::cox.zph(qc.survfit2$fit)`

`## chisq df p## arsenic 0.18440 1 0.668## barium 0.10819 1 0.742## cadmium 0.05345 1 0.817## calcium 0.00206 1 0.964## chromium 1.23974 1 0.266## copper 0.28518 1 0.593## iron 3.46739 1 0.063## lead 0.17575 1 0.675## magnesium 2.12900 1 0.145## manganese 0.58720 1 0.444## mercury 0.00136 1 0.971## selenium 0.15247 1 0.696## silver 0.01040 1 0.919## sodium 0.09352 1 0.760## zinc 1.51261 1 0.219## GLOBAL 9.82045 15 0.831`

`p2 = plot(qc.survfit2, suppressprint = TRUE) p2 + labs(title="Linear log(hazard ratio), overall and exposure specific")`

All bootstrapped functions in `qgcomp`

allow parellelization via the parallel=TRUE parameter (demonstrated with the non-liner fit in `qc.survfit3`

). Only 5 bootstrap iterations are used here, which is not nearly enough for inference, and will actually be slower for parallel processing due to some overhead when setting up the parallel processes.

`qc.survfit3 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ . + .^2,expnms=Xnm, data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4, B=5, MCsize=1000, parallel=TRUE, parplan=TRUE)qc.survfit3`

`## Mixture log(hazard ratio) (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## psi1 -0.19672 0.72426 -1.6162 1.2228 -0.2716 0.7859`

`p3 = plot(qc.survfit3, suppressprint = TRUE) p3 + labs(title="Non-linear log(hazard ratio) overall, linear exposure specific ln-HR")`

*Technical Note:* this mode of usage is designed for simplicity. The implementation relies on the `future`

and `future.apply`

packages. Use guidelines of the `future`

package dictates that the user should be able to control the future “plan”, rather than embedding it in functions as has been done here. This slightly more advanced usage (which allows nesting within larger parallel schemes such as simulations) is demonstrated here by setting the “parplan” parameter to FALSE and explicitly specifying a “plan” outside of qgcomp functions. This will move much of the overhead due to parallel processing outside of the actual qgcomp functions. The final code block of this vignette shows how to exit the “plan” and return to standard evaluation via `plan(sequential)`

- doing so at the end means that the next parallel call (with parplan=FALSE) we make will have lower overhead and run slightly faster.

`future::plan(future::multisession)# parallel evaluationqc.survfit3 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ . + .^2,expnms=Xnm, data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4, B=5, MCsize=1000, parallel=TRUE, parplan=FALSE)qc.survfit3`

`## Mixture log(hazard ratio) (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## psi1 -0.21112 0.30702 -0.81287 0.39063 -0.6876 0.4917`

`p3 = plot(qc.survfit3, suppressprint = TRUE) p3 + labs(title="Non-linear log(hazard ratio) overall, linear exposure specific ln-HR")`

Returning to substance: while `qgcompcox.boot`

fits a smooth hazard ratio function, the hazard ratios contrasting specific quantiles with a referent quantile can be obtained, as demonstrated with `qc.survfit4`

. As in `qgcomp.glm.boot`

plots, the conditional model fit and the MSM fit are overlaid as a way to judge how well the MSM fits the conditional fit (and whether, for example non-linear terms should be added or removed from the overall fit via the degree parameter - we note here that we know of no statistical test for quantifying the difference between these lines, so this is up to user discretion and the plots are provided as visuals to aid in exploratory data analysis).

`qc.survfit4 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ . + .^2,expnms=Xnm, data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4, B=5, MCsize=1000, parallel=TRUE, parplan=FALSE, degree=2)qc.survfit4`

`## Mixture log(hazard ratio) (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)## psi1 -2.73325 4.77640 -12.0948 6.6283 -0.5722 0.5672## psi2 0.84012 1.76149 -2.6123 4.2926 0.4769 0.6334`

`# examining the overall hazard ratio as a function of overall exposurehrs_q = exp(matrix(c(0,0,1,1,2,4,3,9), ncol=2, byrow=TRUE)%*%qc.survfit4$msmfit$coefficients)colnames(hrs_q) = "Hazard ratio"print("Hazard ratios by quartiles (min-25%,25-50%, 50-75%, 75%-max)")`

`## [1] "Hazard ratios by quartiles (min-25%,25-50%, 50-75%, 75%-max)"`

`hrs_q`

`## Hazard ratio## [1,] 1.0000000## [2,] 0.1505986## [3,] 0.1217189## [4,] 0.5279735`

`p4 = plot(qc.survfit4, suppressprint = TRUE) p4 + labs(title="Non-linear log(hazard ratio), overall and exposure specific") `

Testing proportional hazards is somewhat complicated with respect to interpretation. Consider first a linear fit from `qgcomp.cox.noboot`

. Because the underlying model of a linear qgcomp fit is equivalent to the sum of multiple parameters, it is not clear how proportional hazards might be best tested for the mixture. One could examine test statistics for each exposure, but there may be some exposures for which the test indicates non-proportional hazards and some for which the test does not.

The “GLOBAL” test in the cox.zph function from the `survival`

package comes closest to what we might want, and gives an overall assessment of non-proportional hazards for all model terms simultaneously (including non-mixture covariates). While this seems somewhat undesirable due to non-specificity, it is not necessarily important that only the mixture have proportional hazards, so it is useful and easily interpretable to have a global test of fit via GLOBAL.

`# testing proportional hazards (must set x=TRUE in function call)qc.survfit1ph <- qgcomp.cox.noboot(survival::Surv(disease_time, disease_state) ~ .,expnms=Xnm, data=metals[,c(Xnm, 'disease_time', 'disease_state', "mage35")], q=4, x=TRUE)survival::cox.zph(qc.survfit1ph$fit)`

`## chisq df p## arsenic 1.57e-01 1 0.691## barium 1.28e-01 1 0.721## cadmium 5.14e-02 1 0.821## calcium 9.16e-04 1 0.976## chromium 1.25e+00 1 0.263## copper 3.42e-01 1 0.559## iron 3.51e+00 1 0.061## lead 1.59e-01 1 0.690## magnesium 2.08e+00 1 0.149## manganese 5.78e-01 1 0.447## mercury 4.87e-06 1 0.998## selenium 1.32e-01 1 0.716## silver 1.30e-02 1 0.909## sodium 1.06e-01 1 0.745## zinc 1.53e+00 1 0.216## mage35 1.73e-02 1 0.895## GLOBAL 9.93e+00 16 0.870`

For a potentially non-linear/ non-additive fit from `qgcomp.cox.boot`

, the issue is slightly more complicated by the fact that the algorithm will fit both the underlying model and a marginal structural model using the predictions from the underlying model. In order for the predictions to yield valid causal inference, the underlying model must be correct (which implies that proportional hazards hold). The marginal structural model proceeds assuming the underlying model is correct. Currently there is no simple way to allow for non-proportional hazards in the marginal structural model, but non-proportional hazards can be implemented in the conditional model via standard approaches to non-proportional hazards such as time-exposure-interaction terms. This is a rich field and discussion is beyond the scope of this document.

`# testing global proportional hazards for model (note that x=TRUE is not needed (and will cause an error if used))phtest3 = survival::cox.zph(qc.survfit3$fit)phtest3$table[dim(phtest3$table)[1],, drop=FALSE]`

`## chisq df p## GLOBAL 206.5578 120 1.53907e-06`

Late entry and counting-process style data will currently yield results in qgcomp.cox functions. There has been some testing of this in limited settings, but we note that this is still an experimental feature at this point that may not be valid in all cases and so it is not documented here. As much effort as possible to validate results through other means is needed when using qgcomp in data subject to late-entry or when using counting-process style data.

### Example 8: clustering

Clustering on the individual or group level means that there are individual or group level characteristics which result in covariance between observations (e.g. within individual variance of an outcome may be much lower than the between individual variance). For linear models, the error term is assumed to be independent between observations, and clustering breaks this assumption. Ways to relax this assumption include empirical variance estimation and cluster-appropriate robust variance estimation (e.g. through the `sandwich`

package in R). Another way is to use cluster-based bootstrapping, which samples clusters, rather than individual observations. `qgcomp.glm.boot`

can be leveraged to produce clustering consistent estimates of standard errors for independent effects of exposure as well as the effect of the exposure as a whole. This is done using the `id`

parameter of `qgcomp.glm.boot`

(which can only handle a single variable and so may not efficient for nested clustering, for example).

Below is a simple example with one simulated exposure. First the exposure data are ‘pre-quantized’ (so that one can verify that standard errors are appropriate using other means - this is not intended to show a suggested practice). Next the data are analyzed using a 1-component mixture in qgcomp - again, this is for verification purposes. The `qgcomp.glm.noboot`

result yields a naive standard error of 0.0310 for the mixture effect:

`set.seed(2123)N = 250t = 4dat <- data.frame(row.names = 1:(N*t))dat <- within(dat, { id = do.call("c", lapply(1:N, function(x) rep(x, t))) u = do.call("c", lapply(1:N, function(x) rep(runif(1), t))) x1 = rnorm(N, u) y = rnorm(N) + u + x1})# pre-quantizeexpnms = c("x1")datl = quantize(dat, expnms = expnms)qgcomp.glm.noboot(y~ x1, data=datl$dat, family=gaussian(), q = NULL)`

`## Including all model terms as exposures of interest`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.955)## x1 ## 1 ## ## Scaled effect size (negative direction, sum of negative coefficients = 0)## None## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.463243 0.057934 -0.57679 -0.34969 -7.996 3.553e-15## psi1 0.955015 0.031020 0.89422 1.01581 30.787 < 2.2e-16`

`# neither of these ways yields appropriate clustering#qgcomp.glm.noboot(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL)#qgcomp.glm.boot(y~ x1, data=datl$dat, family=gaussian(), q = NULL, MCsize=1000)`

while the `qgcomp.glm.boot`

result (MCsize=5000, B=500) yields a corrected standard error of 0.0398, which is much closer to the sandwich estimate of 0.0409 than the naive estimator (a second `qgcomp.glm.boot`

fit with fewer bootstrap iterations and smaller MCsize is included for display). The standard errors from the uncorrected fit are too low, but this may not always be the case.

`# clustering by specifying id parameter onqgcomp.glm.boot(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL, MCsize=1000, B = 5)`

`## Including all model terms as exposures of interest`

`## ## Note: using all possible values of exposure as the## intervention values`

`## Mixture slope parameters (bootstrap CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.463243 0.084446 -0.62875 -0.29773 -5.4856 5.223e-08## psi1 0.955015 0.055037 0.84714 1.06289 17.3521 < 2.2e-16`

`#qgcomp.glm.boot(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL, MCsize=1000, B = 500)# Mixture slope parameters (bootstrap CI):# # Estimate Std. Error Lower CI Upper CI t value# (Intercept) -0.4632 0.0730 -0.606 -0.32 3.3e-10# psi1 0.9550 0.0398 0.877 1.03 0# This can be verified using the `sandwich` package #fitglm = glm(y~x1, data=datl$dat)#sw.cov = sandwich::vcovCL(fitglm, cluster=~id, type = "HC0")[2,2]#sqrt(sw.cov)# [1] 0.0409`

### Example 9: partial effects

Returning to our original example (and adjusting for covariates), note that the default output for a qgcomp.*.noboot object includes “sum of positive/negative coefficients.” These can be interpreted as “partial mixture effects” or effects of exposures with coefficients in a particular direction. This is displayed graphically via a plot of the qgcomp “weights,” where all exposures that contribute to a given partial effect are on the same side of the plot. Unfortunately, this does not yield valid inference for a true partial effect because it is a parameter conditional on the fitted results and thus does not represent the type of *a priori* hypothesis that is amenable to hypothesis testing and confidence intervals. Another way to think about this is that it is a data adaptive parameter and thus is subject to issues of overfit that are similar to issues with making inference from step-wise variable selection procedures.

`(qc.fit.adj <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, covars, 'y')], expnms=Xnm, family=gaussian()))`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.441)## calcium barium iron arsenic silver chromium cadmium zinc ## 0.76111 0.06133 0.05854 0.02979 0.02433 0.02084 0.01395 0.01195 ## mercury sodium ## 0.01130 0.00688 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.104)## copper magnesium lead manganese selenium ## 0.44654 0.42124 0.09012 0.03577 0.00631 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.460408 0.134007 -0.72306 -0.19776 -3.4357 0.0006477## psi1 0.337539 0.089358 0.16240 0.51268 3.7774 0.0001805`

`plot(qc.fit.adj)`

Fortunately, there is a way towards estimation of “partial effects.” One way to do this is sample splitting, where the data are first randomly partitioned into a “training” and a “validation” data set. The assessment of whether a coefficient is positive or not occurs in the “training” set and then effect estimation for positive/negative partial effects occurs in the validation data. Basic simulations can show that such a procedure can yield valid (often called “honest”) hypothesis tests and confidence intervals for partial effects, provided that there is no separate data exploration in the combined dataset to select the models. In the `qgcomp`

package, the partitioning of the datasets into “training” and “validation” sets is done by the user, which prevents issues that may arise if this is done naively on a dataset that contains clusters (where we should partition based on clusters, rather than observations) or multiple observations per individual (all observations from an individual should be partitioned together). Here is an example of simple partitioning on a dataset that contains one observation per individual with no clustering. The downside of sample splitting is the loss of precision, because the final “validation” dataset comprises only a fraction of the original sample size. Thus, the estimation of partial effects is most appropriate with large sample sizes. We also note that these partial effect are only well defined when all effects are linear and additive, since whether a variable contributes to a positive or negative partial effect would depend on the value of that variable, so the valid estimation of “partial effects” is limited to settings in which the `qgcomp.\*.noboot`

objects are used for inference.

`# 40/60% training/validation splitset.seed(123211)trainidx <- sample(1:nrow(metals), round(nrow(metals)*0.4))valididx <- setdiff(1:nrow(metals),trainidx)traindata <- metals[trainidx,]validdata <- metals[valididx,]dim(traindata) # 40% of total`

`## [1] 181 26`

`dim(validdata) # 60% of total`

`## [1] 271 26`

The qgcomp package then facilitates the analysis of these partitioned data to allow valid estimation and hypothesis testing of partial effects. The `qgcomp.partials`

function is used to estimate partial effects. Note that the variables with “negative effect sizes” differs slightly from the overall analysis given in the `qc.fit`

object that represents our first pass analysis on these data. This is to be expected, and is a feature of this approach: different random subsets of the data will be expected to yield different estimated effects. If the true effect is null, then the estimated effects will vary from positive to negative around the null, and sample splitting is an important way to distinguish between estimates that reflect underlying patterns in the entire dataset from estimates that are simply due to natural sampling variation inherent to small and moderate samples. Note that fitting on these smaller datasets can sometimes result in perfect collinearity of exposures, in which case setting `bayes=TRUE`

may be necessary to apply a ridge penalty to estimates.

`splitres <- qgcomp.partials( fun="qgcomp.glm.noboot", f=y~., q=4, traindata=traindata[,c(Xnm, covars, "y")],validdata=validdata[,c(Xnm, covars, "y")], expnms=Xnm, bayes=FALSE, .fixbreaks = TRUE, .globalbreaks=FALSE )splitres`

`## ## Variables with positive effect sizes in training data: arsenic, barium, cadmium, calcium, chromium, iron, selenium, silver, sodium, zinc## Variables with negative effect sizes in training data: copper, lead, magnesium, manganese, mercury## Partial effect sizes estimated in validation data## Positive direction Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.46347 0.16202 -0.78102 -0.14591 -2.8605 0.004573## psi1 0.35150 0.10849 0.13887 0.56413 3.2400 0.001351## ## Negative direction Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.020164 0.097018 -0.210316 0.16999 -0.2078 0.8355## psi1 0.028852 0.063752 -0.096101 0.15380 0.4526 0.6512`

`plot(splitres$pos.fit)`

Consistent with our overall results, the overall effect of metals on the outcome `y`

is predominantly positive, which is driven mainly by calcium. The partial positive effect of psi=0.42 is slightly attenuated from the partial positive effect given in the original fit (0.44), but is slightly larger than the overall effect from the original fit (psi1=0.34). We note that the effect direction of cadmium is negative, even though it was selected based on positive associations in the training data. This suggests this variable has effects that are close to the null and their direction will depend on which subset of the data are used. This feature allows valid testing of hypotheses - a `global null`

in which no exposures have effects will be characterized by variables that randomly switch effect directions between training and validation datasets, which will yield partial effect estimates close to the null with hypothesis tests that have appropriate type-1 error rates in large datasets.

By default (subject to change) quantile cut points (“breaks”) are defined within the training data and applied to the validation data. You may also change this behavior to allow the breaks to be defined using quantiles from the entire dataset, which treats the quantiles as fixed. This will be expected to improve stability in small samples and may eventually replace the default behavior as the quantiles themselves are not generally treated as random variables within quantile g-computation. For this particular dataset (and seed value), there is little impact of this setting on the results.

`splitres_alt <- qgcomp.partials( fun="qgcomp.glm.noboot", f=y~., q=4, traindata=traindata[,c(Xnm, covars, "y")],validdata=validdata[,c(Xnm, covars, "y")], expnms=Xnm, bayes=FALSE, .fixbreaks = TRUE, .globalbreaks=TRUE )splitres_alt`

`## ## Variables with positive effect sizes in training data: arsenic, barium, cadmium, calcium, chromium, iron, selenium, silver, sodium, zinc## Variables with negative effect sizes in training data: copper, lead, magnesium, manganese, mercury## Partial effect sizes estimated in validation data## Positive direction Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.45550 0.16996 -0.78862 -0.12238 -2.6800 0.007832## psi1 0.34492 0.11364 0.12219 0.56766 3.0352 0.002648## ## Negative direction Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.0079779 0.0981166 -0.200283 0.18433 -0.0813 0.9353## psi1 0.0328941 0.0648437 -0.094197 0.15999 0.5073 0.6124`

One careful note: when there are multiple exposures with small positive or negative effects, the partial effects may be biased towards the null in studies with moderate or small sample sizes. This occurs because, in the training set, some exposures with small effects are likely to be mis-classified with regard to their effect direction. In some instances, both the positive and negative partial effects can be in the same direction. This occurs if individual effects are predominantly in one direction, but some are small and subject to having mis-classified directions. As one example: if there is a null overall effect, but there is a positive partial effect driven strongly by one exposure and a balancing negative partial effect driven by numerous weaker associations, partial effect estimates will not sum to the overall effect because the negative partial effect will experience more downward bias in typical sample sizes. Thus, when the overall effect does not equal the sum of the partial effects, there is likely some bias in at least one of the partial effect estimates. This is not a unique feature of quantile-based g-computation, but is also be a concern for methods that focus on estimation of partial effects, such as weighted quantile sum regression.

The larger question about interpretation (and its worth) of partial effects is left to the analyst. For large datasets with well characterized exposures that have plausible subsets of exposures that would be positively/negatively linearly associated with the outcome, the variables that partition into negative/positive partial effects may make some substantive sense. In more realistic settings that typify exposure mixtures, the partitioning will result in groups that don’t entirely make sense. The “partial effect” yields the effect of increasing all exposures in the subset defined by positive coefficients in the training data, while holding all other exposures and confounders constant. In the setting where this corresponds to real world patterns (e.g. all exposures in the positive partial effect share a source), then this may be interpretable roughly as the effect of an action to intervene on the source of these exposures. In most settings, however, interpretation will not be this clear and should not be expected to map onto potential real-world interventions. We note that this is not a function of the quantile g-computation method, but just part of the general messiness of working with exposures mixture data.

A more justifiable approach in terms of mapping effect estimates onto potential real-world actions would be choosing subsets of exposures based on prior subject matter knowledge, as we demonstrated above in example 5. This does not require sample splitting, but it does come at the expense of having to know more about the exposures and outcome under analysis. For example, our simulated outcome `y`

may represent some outcome that we would expect to increase with so-called “essential” metals, or those that are necessary (at some small amount) for normal human functioning, but it may decrease with “potentially toxic” (non-essential) metals, or those that have no known biologic function and are more likely to cause harm rather than improve physiologic processes that lead to improved (larger) values of `y`

. Qgcomp can be used to assess effects of these “sub-mixtures.”

Here are results for the essential metals:

`nonessentialXnm <- c( 'arsenic','barium','cadmium','chromium','lead','mercury','silver')essentialXnm <- c( 'sodium','magnesium','calcium','manganese','iron','copper','zinc','selenium')covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')(qc.fit.essential <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, covars, 'y')], expnms=essentialXnm, family=gaussian()))`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.357)## calcium iron zinc selenium ## 0.908914 0.077312 0.013128 0.000646 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.0998)## copper magnesium sodium manganese ## 0.4872 0.4304 0.0486 0.0338 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.340130 0.111562 -0.55879 -0.12147 -3.0488 0.002435## psi1 0.257136 0.074431 0.11125 0.40302 3.4547 0.000604`

Here are results for the non-essential metals:

`(qc.fit.nonessential <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, covars, 'y')], expnms=nonessentialXnm, family=gaussian()))`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.0751)## barium arsenic silver mercury cadmium ## 0.2827 0.2494 0.2420 0.1763 0.0496 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.0129)## lead chromium ## 0.888 0.112 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.055549 0.081509 -0.215304 0.10420 -0.6815 0.4959## psi1 0.062267 0.052508 -0.040648 0.16518 1.1858 0.2363`

As shown from these results, the essential metals and minerals demonstrate an overall positive joint association with the outcome (controlling for non-essentials), whereas the partial effect of non-essential metals and minerals is close to null. This is close to the interpretation of the data adaptive selection of partial effects demonstrated above, but is interpretable in terms of how we might intervene (e.g. increase consumption of foods that are higher in essential metals and minerals).

### Example 10: multinomial outcomes

For outcomes modeled as 3+ discrete categories, qgcomp joint effect estimates are interpreted as a ratio of the probability of being in the referent category of the outcome and the probability of being in the index category.

First, we’ll bring in data and create a multinomial outcome.

`data("metals") # from qgcomp package# create categorical outcome from the existing continuous outcome (usually, one will already exist)metals$ycat = factor(quantize(metals, "y",q=4)$data$y, levels=c("0", "1", "2", "3"), labels=c("cct", "ccg", "aat", "aag")) # restrict to smaller dataset for simplicitysmallmetals = metals[,c("ycat", "arsenic", "lead", "cadmium", "mage35")]`

Next, fit the model.

`### 1: Define mixture and underlying model ####mixture = c("arsenic", "lead", "cadmium")f2 = ycat ~ arsenic + lead + cadmium + mage35rr = qgcomp.multinomial.noboot( f2, expnms = mixture, q=4, data = smallmetals, )rr2 = qgcomp.multinomial.boot( f2, expnms = mixture, q=4, data = smallmetals, B =2, # set to higher values >200 in general usage MCSize=10000 # set to higher values in small samples )summary(rr)`

`## Reference outcome levels:## cct ccg aat aag## Weights## arsenic lead cadmium## ccg -0.19985072 -0.4065664 -0.3935829## aat -0.02996557 1.0000000 -0.9700344## aag -0.36389597 -0.3920415 -0.2440625## ## Sum of positive coefficients ## ccg aat aag ## 0.000000000 0.006243471 0.000000000 ## Sum of negative coefficients ## ccg aat aag ## -0.3255537 -0.1449220 -0.2362489 ## ## Mixture slope parameters (Standard CI):## Estimate Std. Error## ccg.(Intercept) 0.341521 0.324928## aat.(Intercept) 0.098838 0.330695## aag.(Intercept) 0.261681 0.326332## ccg.psi -0.325554 0.204573## aat.psi -0.138679 0.203451## aag.psi -0.236249 0.203446`

`summary(rr2) # differs from `rr` primarily due to low `MCSize` value`

`## Reference outcome levels:## cct ccg aat aag## Mixture slope parameters (Bootstrap CI):## Estimate Std. Error## ccg.(Intercept) 0.294923 0.045838## aat.(Intercept) -0.009291 0.068102## aag.(Intercept) 0.193000 0.076732## ccg.psi1 -0.238836 0.012265## aat.psi1 -0.058937 0.051478## aag.psi1 -0.197627 0.197277`

` plot(rr) #plot(rr2) # not yet functional`

Some of the convenience functions, like `plot`

and `summary`

are available for multinomial fits, and more functionality will be available in the future.

## Missing data, limits of detection and multiple imputation

When carrying out data analysis using quantile g-computation, on can address missing data in much the same way as in standard regression analyses. A common approach is complete case analysis. While regression functions in R will automatically carry out complete case analyses when variables take on the value `NA`

(denoting missingness in R), when using quantile g-computation it is encouraged that one explicitly create the complete case dataset explicitly and use that complete case dataset. Using the pre-installed R packages, this can be accomplished with the `complete.cases`

function.

The reason for this recommendation is that, while the regression analysis will be performed on complete cases (i.e. observations with non-missing values for all variables in the model), the calculation of quantiles for each exposures is done on an exposure-by-exposure basis, which can lead to numerical differences when explicitly using a dataset restricted to complete cases versus relying on automatically removing observations with `NA`

values during analysis.

Here is an artificial example that demonstrates the differences. Three analyses are run: one with the full data, one with complete case data (complete case analysis #1), and one with data in which `arsenic`

has been randomly set to `NA`

(complete case analysis #2).

There are numeric differences between the two complete case analyses, which can be traced to differences in the “quantized” exposures (other than arsenic) in the two approaches, which can be found by the `qx`

data frame that is part of a `qgcompfit`

object.

`Xnm <- c( 'arsenic','barium','cadmium','calcium','chromium','copper', 'iron','lead','magnesium','manganese','mercury','selenium','silver', 'sodium','zinc')covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')asmiss = metalsset.seed(1232)asmiss$arsenic = ifelse(runif(nrow(metals))>0.7, NA, asmiss$arsenic)cc = asmiss[complete.cases(asmiss[,c(Xnm, covars, "y")]),] # complete.cases gives a logical index to subset rowsdim(metals) # [1] 452 26`

`## [1] 452 27`

`dim(cc) # [1] 320 26`

`## [1] 320 27`

Here we have results from the full data (for comparison purposes)

`qc.base <- qgcomp.glm.noboot(y~.,expnms=Xnm, dat=metals[,c(Xnm, covars, 'y')], family=gaussian())cat("Full data\n")`

`## Full data`

`qc.base`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.441)## calcium barium iron arsenic silver chromium cadmium zinc ## 0.76111 0.06133 0.05854 0.02979 0.02433 0.02084 0.01395 0.01195 ## mercury sodium ## 0.01130 0.00688 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.104)## copper magnesium lead manganese selenium ## 0.44654 0.42124 0.09012 0.03577 0.00631 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.460408 0.134007 -0.72306 -0.19776 -3.4357 0.0006477## psi1 0.337539 0.089358 0.16240 0.51268 3.7774 0.0001805`

Here we have results from a complete case analysis in which we have set some exposures to be missing and have explicitly excluded data that will be dropped from analysis.

`qc.cc <- qgcomp.glm.noboot(y~.,expnms=Xnm, dat=cc[,c(Xnm, covars, 'y')], family=gaussian())cat("Complete case analyses\n")`

`## Complete case analyses`

`cat(" #1 explicitly remove observations with missing values\n")`

`## #1 explicitly remove observations with missing values`

`qc.cc`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.48)## calcium iron zinc chromium silver lead barium arsenic ## 0.78603 0.07255 0.04853 0.03315 0.02555 0.01405 0.01285 0.00728 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.117)## copper magnesium cadmium mercury manganese sodium selenium ## 0.52200 0.23808 0.09082 0.06570 0.04105 0.03921 0.00315 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.509402 0.145471 -0.79452 -0.22428 -3.5017 0.0005315## psi1 0.362582 0.096696 0.17306 0.55210 3.7497 0.0002119`

Finally we have results from a complete case analysis in which we have set some exposures to be missing, but we rely on R’s automated dropping of observations with missing values.

`qc.cc2 <- qgcomp.glm.noboot(y~.,expnms=Xnm, dat=asmiss[,c(Xnm, covars, 'y')], family=gaussian())cat(" #1 rely on R handling of NA values\n")`

`## #1 rely on R handling of NA values`

`qc.cc2`

`## Scaled effect size (positive direction, sum of positive coefficients = 0.493)## calcium iron zinc chromium barium silver lead selenium ## 0.76495 0.07191 0.04852 0.03489 0.02151 0.02119 0.01912 0.01444 ## arsenic ## 0.00348 ## ## Scaled effect size (negative direction, sum of negative coefficients = -0.12)## copper magnesium sodium manganese cadmium mercury ## 0.5338 0.2139 0.0965 0.0853 0.0501 0.0204 ## ## Mixture slope parameters (Delta method CI):## ## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)## (Intercept) -0.51673 0.15001 -0.81074 -0.22273 -3.4447 0.0006520## psi1 0.37249 0.10014 0.17621 0.56876 3.7196 0.0002376`

Now we can see a reason for the discrepancy between the methods above: when relying on R to drop missing values by allowing missing values for exposures in the analytic data, the quantiles of exposures will be done on all valid values for each exposure. In the complete case data, the quantiles will only be calculated among those with complete observations. The latter will generally be preferred because the quantiles for each exposure will be calculated on the same sample of individuals.

`# calculation of arsenic quantiles is identicalall.equal(qc.cc$qx$arsenic_q, qc.cc2$qx$arsenic_q[complete.cases(qc.cc2$qx$arsenic_q)])`

`## [1] TRUE`

`# all are equalall.equal(qc.cc$qx$cadmium_q, qc.cc2$qx$cadmium_q[complete.cases(qc.cc2$qx$arsenic_q)])`

`## [1] "Mean relative difference: 0.3823529"`

`# not equal`

### Limits of detection

A common form of missing data that occurs in mixtures are exposure values that are missing due to being below the limit of detection. A common approach to such missing data is imputation, either through filling in small numeric values in place of the missing values, or in a more formal multiple imputation from a parametric model. Notably, with quantile g-computation, if the proportion of values below the limit of detection is below 1/q (the number of quantiles), all appropriate missing data approaches will yield the same answer. Thus, if one has 3 exposures each with 10% of the values below the limit of detection, one can impute small values below those limits (e.g. limit of detection divided by the square root of 2) and proceed with quantile g-computation on the imputed data. This analysis leverages the fact that, even if a value below the limit of detection cannot be known with certainty, the category score used in quantile g-computation is known with certainty. In cases with more than 1/q% measurements below the LOD, the packages `qgcomp`

comes with a convenience function `mice.impute.leftcenslognorm`

that can be used to impute values below the limit of detection from a left censored log-normal regression model.

### Multiple imputation

Multiple imputation uses multiple datasets with different imputed values for each missing value across datasets. Separate analyses are performed on each of these datasets, and the results are combined using standard rules. The function `mice.impute.leftcenslognorm`

can be interfaced with the `mice`

package for efficient programming of multiple imputation based analysis with quantile g-computation. Examples cannot be included here without explicitly installing the `mice`

package, but an example can be seen in the help file for `mice.impute.leftcenslognorm`

.

## FAQ

### Why don’t I get weights from the `boot`

functions?

Users often use the `qgcomp.*.boot`

functions because the want to marginalize over confounders or fit a non-linear joint exposure function. In both cases, the overall exposure response will no longer correspond to a simple weighted average of model coefficients, so none of the `qgcomp.*.boot`

functions will calculate weights. In most use cases, the weights would vary according to which level of joint exposure you’re at, so it is not a straightforward proposition to calculate them (and you may not wish to report 4 sets of weights if you use the default `q=4`

). If you fit an otherwise linear model, you can get weights from a `qgcomp.*.noboot`

which will be very close to the weights you might get from a linear model fit via `qgcomp.*.boot`

functions, but be explicit that the weights come from a different model than the inference about joint exposure effects.

### Do I need to model non-linearity and non-additivity of exposures?

Maybe. The inferential object of qgcomp is the set of \(\psi\) parameters that correspond to a joint exposure response. As it turns out, with correlated exposures non-linearity can disguise itself as non-additivity (Belzak and Bauer [2019] Addictive Behaviors). If we were inferring independent effects, this distinction would be crucial, but for joint effects it may turn out that it doesn’t matter much if you model non-linearity in the joint response function through non-additivity or non-linearity of individual exposures in a given study. Models fit in qgcomp still make the crucial assumption that you are able to model the joint exposure response via parametric models, so that assumption should not be forgotten in an effort to try to disentagle non-linearity (e.g. quadratic terms of exposures) from non-additivity (e.g. product terms between exposures). The important part to note about parametric modeling is that we have to explicitly tell the model to be non-linear, and no adaptation to non-linear settings will happen automatically. Exploring non-linearity is not a trivial endeavor.

### Do I have to use quantiles?

No. You can turn off “quantization” by setting `q=NULL`

or you can supply your own categorization cutpoints via the “breaks” argument. It is up to the user to interpret the results if either of these options is taken.

### Can I cite this document?

Probably not in a scientific manuscript. If you find an idea here that is not published anywhere else and wish to develop it into a full manuscript, feel free! (But probably check with akeil@unc.edu to ask if a paper is already in development.)

## References

Alexander P. Keil, Jessie P. Buckley, Katie M. O’Brien, Kelly K. Ferguson, Shanshan Zhao,Alexandra J. White. A quantile-based g-computation approach to addressing the effects of exposure mixtures. https://doi.org/10.1289/EHP5838

## Acknowledgments

The development of this package was supported by NIH Grant RO1ES02953101. Invaluable code testing was performed by Nicole Niehoff, Michiel van den Dries, Emily Werder, Jessie Buckley, and Katie O’Brien.

`# return to standard processingfuture::plan(future::sequential) # return to standard evaluation`