Variance-Covariance Estimation in `mlmodels`

library(mlmodels)

Introduction

Accurate variance-covariance estimation is essential for statistical inference in maximum likelihood models. The variance-covariance matrix allows us to:

Compute standard errors for parameter estimates
Construct confidence intervals
Perform hypothesis tests (Wald, z-tests, etc.)
Calculate p-values and assess statistical significance

Without a reliable variance-covariance matrix, we cannot make valid inferences about the parameters of our model.

The mlmodels package provides a comprehensive set of variance-covariance estimators, through its vcov() implementation, ranging from classical information matrix methods to robust and resampling-based approaches. This vignette explains each method, when to use them, and how they relate to each other.

Why Different Variance Estimators Matter

Different estimators make different assumptions about the model:

Classical methods (oim, opg) assume the model is correctly specified (the Information Matrix Identity holds).
Robust methods relax this assumption and are valid under more general conditions.
Resampling methods (bootstrap, jackknife) make even fewer parametric assumptions and are often the most robust, at the cost of higher computation time.

Choosing the right variance estimator is therefore an important part of responsible statistical practice.

Classical Information Matrix Estimators

The package provides two classical variance estimators based on the information matrix:

Observed Information Matrix (`oim`) — Default

The default variance estimator in mlmodels is the Observed Information Matrix (type = "oim"). It is used automatically unless you specify otherwise.

data(mroz)
mroz$incthou <- mroz$faminc / 1000

# Fit a homoskedastic linear model
fit <- ml_lm(incthou ~ age + I(age^2) + huswage + educ + unem, 
             data = mroz)

# Summary calls vcov() which uses oim as default.
summary(fit)
#> 
#> Maximum Likelihood Model
#>  Type: Homoskedastic Linear Model 
#> ---------------------------------------
#> Call:
#> ml_lm(value = incthou ~ age + I(age^2) + huswage + educ + unem, 
#>     data = mroz)
#> 
#> Log-Likelihood: -2638.68 
#> 
#> Wald significance tests:
#>  all: Chisq(5) = 972.786, Pr(>Chisq) = < 1e-8
#> 
#> Variance type: Original Information Matrix
#> ---------------------------------------
#>                       Estimate Std. Error z value Pr(>|z|)     
#> Value (incthou):  
#>   value::(Intercept) -29.4778     8.6369  -3.413 0.000643 ***
#>   value::age           1.2623     0.3996   3.159 0.001585 ** 
#>   value::I(age^2)     -0.0136     0.0046  -2.943 0.003250 ** 
#>   value::huswage       1.9566     0.0733  26.691  < 2e-16 ***
#>   value::educ          0.9636     0.1356   7.106 1.19e-12 ***
#>   value::unem         -0.2538     0.0957  -2.651 0.008028 ** 
#> Scale (log(sigma)):  
#>   scale::lnsigma       2.0853     0.0258  80.924  < 2e-16 ***
#> ---------------------------------------
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> ---
#> Number of observations: 753 
#> Residual degrees of freedom: 747 
#> Multiple R-squared: 0.5637 Adjusted R-squared: 0.5608
#> AIC: 5291.36  BIC: 5323.72 
#> Residual standard error (sigma): 8.047

v_oim <- vcov(fit, type = "oim")
sqrt(diag(v_oim))
#> value::(Intercept)         value::age    value::I(age^2)     value::huswage 
#>        8.636925227        0.399627579        0.004609191        0.073304149 
#>        value::educ        value::unem     scale::lnsigma 
#>        0.135599149        0.095744322        0.025768404

Notice how the standard errors shown in summary() are identical to those obtained directly from vcov(fit, type = "oim"). summary() displays the type of variance used clearly, above the coefficient table.

The oim estimator is computationally efficient and statistically attractive when the model is correctly specified (i.e., the outcome follows the assumed distribution and both the value and scale, if present, equations are appropriate).

Outer Product of Gradients (`opg`)

An alternative classical estimator is the Outer Product of Gradients (also known as BHHH).

When a maximum likelihood estimation model is well specified, the negative Hessian matrix (which is the base for the oim variance), and the outer product of the gradients matrix (which is the basis for the opg variance) whould be equal. This is what is called the Information Matrix Identity.

v_opg <- vcov(fit, type = "opg")

When a maximum likelihood model is correctly specified, the Information Matrix Identity holds. This means that the negative expected Hessian matrix (basis of the oimvariance) should be equal (in expectation) to the outer product of the gradients matrix (basis of the opg variance).

In practice, you can compare the two estimators to get a sense of model adequacy:

comp <- data.frame(
  oim = sqrt(diag(v_oim)),
  opg = sqrt(diag(v_opg))
)

comp
#>                            oim         opg
#> value::(Intercept) 8.636925227 9.281669188
#> value::age         0.399627579 0.431781719
#> value::I(age^2)    0.004609191 0.005002672
#> value::huswage     0.073304149 0.044304744
#> value::educ        0.135599149 0.116741374
#> value::unem        0.095744322 0.096801039
#> scale::lnsigma     0.025768404 0.014396434

If the model is well identified and correctly specified, the oim and opg standard errors should be very similar. Significant differences between them can be an indication of model misspecification or weak identification.

This is one of the reasons mlmodels provides more robust variance estimators (robust, boot, and jack) — they do not rely on the Information Matrix Identity holding.

The “Hessian may be singular” warning

When computing the oim variance, you may occasionally see the following warning:

! OIM variance may be unreliable due to singularity in the Hessian.

This warning indicates that the observed information matrix is close to singular (nearly non-invertible).

What this means:
It is a strong diagnostic signal of model misspecification or numerical instability. Common causes include:

Multicollinearity among predictors.
Overparameterized models.
Inappropriate specification of the value or scale equations.
Poor identification of some parameters.

What it does NOT mean:

It does not automatically mean that all coefficient estimates are meaningless.
It does not imply that all misspecified models trigger this warning. Only that those that do are very likely misspecified.

Example: Heteroskedastic Logit Model

A common situation where this warning appears is when the scale parameters are counteracting with the value parameters, creating very different levels of curvature in the log-likelihood.

We illustrate it with a binary logit example:


fit_het <- ml_logit(inlf ~ age + I(age^2) + huswage + educ + unem,
                    scale = ~ repwage,
                    data = mroz)
#> ℹ Improving initial values by scaling (factor = 2).
#> ℹ Initial log-likelihood: -477.993
#> ℹ Final scaled log-likelihood: -426.248

summary(fit_het, vcov.type = "oim")
#> ! OIM variance may be unreliable due to singularity in the Hessian.
#> ℹ Relative singularity: "value::(Intercept)" has the largest eigenvalue (72495495908044.1), while "scale::repwage" has the smallest (15.666216).
#> ℹ Consider using `type = 'robust'` or revising the model specification.
#> 
#> Maximum Likelihood Model
#>  Type: Heteroskedastic Binary Logit 
#> ---------------------------------------
#> Call:
#> ml_logit(value = inlf ~ age + I(age^2) + huswage + educ + unem, 
#>     scale = ~repwage, data = mroz)
#> 
#> Log-Likelihood: -349.21 
#> 
#> Wald significance tests:
#>  all: Chisq(6) = 1194.371, Pr(>Chisq) = < 1e-8
#>  Mean: Chisq(5) = 6.844, Pr(>Chisq) = 0.2325
#>  Scale: Chisq(1) = 82.955, Pr(>Chisq) = < 1e-8
#> 
#> Variance type: Original Information Matrix
#> ---------------------------------------
#>                       Estimate Std. Error z value Pr(>|z|)     
#> Value (inlf):  
#>   value::(Intercept)  0.55597    0.23180   2.399   0.0165 *  
#>   value::age         -0.02582    0.01120  -2.306   0.0211 *  
#>   value::I(age^2)     0.00033    0.00014   2.327   0.0200 *  
#>   value::huswage     -0.00046    0.00030  -1.511   0.1308    
#>   value::educ        -0.00310    0.00248  -1.250   0.2115    
#>   value::unem         0.00153    0.00109   1.404   0.1604    
#> Scale (log(sigma)):  
#>   scale::repwage     -1.46321    0.16065  -9.108   <2e-16 ***
#> ---------------------------------------
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> ---
#> Number of observations:753 (Successes: 428, Failures: 325)
#> Pseudo R-squared - Cor.Sq.: 0.467 McFadden: 0.3218
#> AIC: 712.42  BIC: 744.79 
#> 
#> Distribution of Std. Deviation (sigma):
#> ---------------------------------------
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.0000  0.0053  1.0000  0.5603  1.0000  1.0000

In this case, summary() triggers the warning. While the package still displays results, you should not trust the wald tests, the standard errors, z-statistics, or p-values shown under the oim variance when this warning is triggered.

One of the recommendations shown in the warning message is to switch to the robust variance estimator. We explore robust and other more flexible variance options in the next section.

Robust Variance Estimators

The use of more robust variance estimators becomes important when the Information Matrix Identity may not hold. The singular Hessian warning we saw in the previous section is one practical indication of this problem. A more formal diagnostic is the Information Matrix Test (IMtest()), which we cover in the Model Diagnostics vignette.

Robust (Sandwich / Huber-White) Estimator

The most commonly used robust estimator is the sandwich (or Huber-White) estimator. It is the recommended default when you suspect the model may be misspecified or when you see the singular Hessian warning.

  # summary() with the vcov.type argument
summary(fit, vcov.type = "robust")
#> 
#> Maximum Likelihood Model
#>  Type: Homoskedastic Linear Model 
#> ---------------------------------------
#> Call:
#> ml_lm(value = incthou ~ age + I(age^2) + huswage + educ + unem, 
#>     data = mroz)
#> 
#> Log-Likelihood: -2638.68 
#> 
#> Wald significance tests:
#>  all: Chisq(5) = 325.926, Pr(>Chisq) = < 1e-8
#> 
#> Variance type: Robust
#> ---------------------------------------
#>                       Estimate Std. Error z value Pr(>|z|)     
#> Value (incthou):  
#>   value::(Intercept) -29.4778     8.4113  -3.505 0.000457 ***
#>   value::age           1.2623     0.3823   3.302 0.000960 ***
#>   value::I(age^2)     -0.0136     0.0044  -3.089 0.002011 ** 
#>   value::huswage       1.9566     0.1360  14.391  < 2e-16 ***
#>   value::educ          0.9636     0.1652   5.834 5.42e-09 ***
#>   value::unem         -0.2538     0.0987  -2.572 0.010109 *  
#> Scale (log(sigma)):  
#>   scale::lnsigma       2.0853     0.0556  37.538  < 2e-16 ***
#> ---------------------------------------
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> ---
#> Number of observations: 753 
#> Residual degrees of freedom: 747 
#> Multiple R-squared: 0.5637 Adjusted R-squared: 0.5608
#> AIC: 5291.36  BIC: 5323.72 
#> Residual standard error (sigma): 8.047

# or vcov() with the type argument
v_robust <- vcov(fit, type = "robust")

When you specify vcov.type = "robust" in summary(), the function uses the robust variance not only for the standard errors, z-values, and p-values in the coefficient table, but also for the Wald significance tests shown at the top of the output.

You can also pass a pre-computed variance directly using the vcov argument.

summary(fit, vcov = v_robust)
#> 
#> Maximum Likelihood Model
#>  Type: Homoskedastic Linear Model 
#> ---------------------------------------
#> Call:
#> ml_lm(value = incthou ~ age + I(age^2) + huswage + educ + unem, 
#>     data = mroz)
#> 
#> Log-Likelihood: -2638.68 
#> 
#> Wald significance tests:
#>  all: Chisq(5) = 325.926, Pr(>Chisq) = < 1e-8
#> 
#> Variance type: Robust
#> ---------------------------------------
#>                       Estimate Std. Error z value Pr(>|z|)     
#> Value (incthou):  
#>   value::(Intercept) -29.4778     8.4113  -3.505 0.000457 ***
#>   value::age           1.2623     0.3823   3.302 0.000960 ***
#>   value::I(age^2)     -0.0136     0.0044  -3.089 0.002011 ** 
#>   value::huswage       1.9566     0.1360  14.391  < 2e-16 ***
#>   value::educ          0.9636     0.1652   5.834 5.42e-09 ***
#>   value::unem         -0.2538     0.0987  -2.572 0.010109 *  
#> Scale (log(sigma)):  
#>   scale::lnsigma       2.0853     0.0556  37.538  < 2e-16 ***
#> ---------------------------------------
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> ---
#> Number of observations: 753 
#> Residual degrees of freedom: 747 
#> Multiple R-squared: 0.5637 Adjusted R-squared: 0.5608
#> AIC: 5291.36  BIC: 5323.72 
#> Residual standard error (sigma): 8.047

The displayed results will be identical because the variance matrix is the same.

Clustering

If you suspect that it is the independence of the observations that is violated, in that observations may be correlated within groups (clusters), then the cluster-robust variance is appropriate:

# Suspect that income is correlated among education levels.
v_rob_clust <- vcov(fit, type = "robust", cl_var = "educ")
summary(fit, vcov = v_rob_clust)
#> 
#> Maximum Likelihood Model
#>  Type: Homoskedastic Linear Model 
#> ---------------------------------------
#> Call:
#> ml_lm(value = incthou ~ age + I(age^2) + huswage + educ + unem, 
#>     data = mroz)
#> 
#> Log-Likelihood: -2638.68 
#> 
#> Wald significance tests:
#>  all: Chisq(5) = 1062.475, Pr(>Chisq) = < 1e-8
#> 
#> Variance type: Robust | Clusters: 13 (educ)
#> ---------------------------------------
#>                       Estimate Std. Error z value Pr(>|z|)     
#> Value (incthou):  
#>   value::(Intercept) -29.4778     6.2953  -4.683 2.83e-06 ***
#>   value::age           1.2623     0.3009   4.196 2.72e-05 ***
#>   value::I(age^2)     -0.0136     0.0033  -4.122 3.76e-05 ***
#>   value::huswage       1.9566     0.0884  22.132  < 2e-16 ***
#>   value::educ          0.9636     0.1770   5.444 5.22e-08 ***
#>   value::unem         -0.2538     0.0658  -3.857 0.000115 ***
#> Scale (log(sigma)):  
#>   scale::lnsigma       2.0853     0.1112  18.753  < 2e-16 ***
#> ---------------------------------------
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> ---
#> Number of observations: 753 
#> Residual degrees of freedom: 747 
#> Multiple R-squared: 0.5637 Adjusted R-squared: 0.5608
#> AIC: 5291.36  BIC: 5323.72 
#> Residual standard error (sigma): 8.047

Bootstrap and Jackknife Estimators

The package also provides two resampling-based variance estimators:

boot - Bootstrap (with optional clustering).
jack/jacknife - Jackknife (with optional clustering).

Bootstrap works by repeatedly resampling the estimation data with replacement (default 999 times, but 500 is often sufficient), re-fitting the model on each resampled dataset, and then computing the variance from the resulting matrix of coefficient estimates.

Jackknife works by systematically leaving out one observation at a time and re-fitting the model. The number of replications is therefore equal to the sample size.

# Bootstrap with 500 repetitions
v_boot <- vcov(fit, type = "boot", repetitions = 500, seed = 123)
#> ℹ Bootstrap with 500 repetitions.
#>  0        10        20        30        40        50
#> ====================================================
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#> ====================================================
#> 
#> Bootstrapping finished - 100% of replications converged.
# Jackknife
v_jack <- vcov(fit, type = "jack")
#> ℹ Jackknife variance.
#>  0        10        20        30        40        50
#> ====================================================
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ..................................................
#>  ...
#> ====================================================
#> 
#> Jackknife finished - 100% of replications converged.

By default, vcov() shows a progress table, with green dots where the iterations’ fits are successful, or red crosses where they’re not, when type equals boot or jack/jacknife. You can stop vcov() from displaying it by setting progress = FALSE.

When you set vcov.type to either "boot" or "jack" in summary(), or other functions that also have a vcov.type argument, the progress display is not shown by default (progress = FALSE by default).

You can also estimate clustered versions of these variances:

# Bootstrap with 500 repetitions
v_boot_clust <- vcov(fit, type = "boot", cl_var = "educ",
               repetitions = 500, seed = 123,
               progress = FALSE)
# Jackknife
v_jack_clust <- vcov(fit, type = "jack", cl_var = "educ",
               progress = FALSE)

Why use resampling methods?

Traditional robust estimators (like the sandwich estimator) still rely on asymptotic normality for inference - that is, they assume that the sampling distribution of the parameter estimates becomes approximately normal in large samples.

Bootstrap and jackknife take a more direct approach: they estimate the sampling distribution of the parameters empirically from the data itself, without assuming normality. This makes them:

More robust to violations of model assumptions.
Better at capturing the actual variability in finite samples.
Particularly useful when the model is complex or the asymptotic approximations are questionable.

Asymptotic Equivalence

The regular bootstrap and jackknife are asymptotically equivalent to the robust (sandwich) estimator.
The clustered versions are asymptotically equivalent to the cluster-robust estimator.

Here we compare the standard errors produced by each method.

# Robust Comparison
comp<- data.frame(
  robust = sqrt(diag(v_robust)),
  boot = sqrt(diag(v_boot)),
  jack = sqrt(diag(v_jack))
)
comp
#>                         robust       boot        jack
#> value::(Intercept) 8.411271366 8.31747787 8.507232115
#> value::age         0.382284787 0.38167461 0.386935880
#> value::I(age^2)    0.004391868 0.00440611 0.004446869
#> value::huswage     0.135952916 0.13824719 0.144060112
#> value::educ        0.165176906 0.16856717 0.167837527
#> value::unem        0.098676875 0.10157004 0.100966601
#> scale::lnsigma     0.055550631 0.05434783 0.057408605

# Clustered comparison
comp <- data.frame(
  robust = sqrt(diag(v_rob_clust)),
  boot = sqrt(diag(v_boot_clust)),
  jack = sqrt(diag(v_jack_clust))
)
comp
#>                        robust        boot        jack
#> value::(Intercept) 6.29528438 8.683355306 6.310413395
#> value::age         0.30086228 0.410357979 0.297872890
#> value::I(age^2)    0.00329108 0.004470342 0.003268108
#> value::huswage     0.08840541 0.110632230 0.093001512
#> value::educ        0.17701278 0.181024534 0.212512715
#> value::unem        0.06580827 0.088824945 0.065578436
#> scale::lnsigma     0.11119600 0.112927477 0.154391247

The benefit of the resampling methods is that they make fewer parametric assumptions, which often leads to better finite-sample performance.

Important Feature of Our Implementation

When some bootstrap or jackknife replications fail to converge, our vcov() methods compute the variance using only the successful replications. This is statistically more appropriate than including failed optimizations (which is what sandwich::vcovBS() does).

Practical Recommendations

Choosing the right variance estimator depends on your goals and concerns about model specification. Here is a practical guide:

Quick Decision Guide

Situation	Recommended Variance	Why
Model is well-specified	`oim`	Most efficient
OIM singular Hessian warning	`robust` / `boot`	Robust to violations
Suspect heteroskedasticity or other violations	`robust` / `boot`	Robust to violations
Suspect correlation within groups	`robust` / `boot` + `cl_var`	Accounts for dependence

Best Practices

Start with robust – In most applied work, vcov.type = "robust" is expected in journals, even if you’re modeling heteroskedasticty.
Pre-compute expensive variances – This is mostly for boot and jack, although it also applies to the other ones. When you want to use one type of variance for your estimates, predictions, and marginal effects, it is very convenient to store the variance and feed it to the different functions.
Check model diagnostics – Use IMtest() to assess whether classical estimators (oim, opg) are suitable. See the vignette on diagnostics for a discussion about this test and its methods.
When to prefer bootstrap/jackknife
- Small to moderate sample sizes.
- Complex models.
- Avoid strong parametric assumptions.
Reporting – Always state which variance estimator, or estimators if you use several, you used in your paper or report.

Concluding Remarks

The mlmodels package produces a comprehensive and coherent suite of variance-covariance estimators through the unified vcov() function. From the classical information matrix estimation (oim, opg), to robust and cluster-robust methods, allowing you to get these via parametric assumptions (robust) or via more flexible resampling methods (boot and jack).

Our design philosophy has been to give you both power and guidance:

Clear informative warnings when OIM estimator may be unreliable.
Robust default behavior and sensible options for advanced users.
Consistent interfaces for switching between variance stypes is straightforward.
Careful handling of edge cases, such as using only successful replications in bootstrap and jackknife procedures.

We built these tools because we observed that other existing packages did not provide this level of detail and/or solid estimation methods. Our goal was to give you the right standard errors for your estimations, predictions, and marginal effects, by simply setting one or two arguments (type and cl_var).

We hope his vignette helps you understand the strengths and appropriate use of each estimator, so you can conduct your analyses with confidence and transparency.

Happy modeling!