Stata vs. R: Ordered logistic regression
I recently experienced a great example of trying to do something relatively basic in R that I could not figure out (okay—that happens all the time for me, but let's pretend). In Stata, running an ordered logistic regression with cluster robust standard errors is straight-forward.
sjPlot
package) than what is shown in the original post. See below:Update
The sjPlot
package accepts polr
models and can produce cluster robust standard errors with very little effort. Here's the solution in R:
library(MASS)
library(sjPlot)
results <- polr(y ~ x, data = df, Hess = TRUE)
tab_model(
results,
vcov.fun = "CL",
vcov.args = list(type = "HC1", cluster = df$cluster_variable),
show.se = TRUE
)
This produces a beautiful HTML table - here's an example with some real data (that matches Stata's results and also automatically reports the thresholds as odds ratios):
suicide_fct | ||||
---|---|---|---|---|
Predictors | Odds Ratios | std. Error | CI | p |
1|2 | 0.25 | 0.01 | 0.23 – 0.28 | <0.001 |
2|3 | 0.76 | 0.04 | 0.69 – 0.84 | <0.001 |
3|4 | 2.37 | 0.14 | 2.12 – 2.65 | <0.001 |
Year [2024] | 0.69 | 0.04 | 0.61 – 0.79 | <0.001 |
Observations | 22365 | |||
R2 Nagelkerke | 0.011 |
Original Post
I recently experienced a great example of trying to do something relatively basic in R that I could not figure out (okay—that happens all the time for me, but let's pretend). In Stata, running an ordered logistic regression with cluster robust standard errors is straight-forward. It only takes one line of code:
ologit y x, or vce(cluster cluster_variable)
Here, the or
option requests the results to be reported in odds ratios and the vce(cluster cluster_variable)
reports cluster-robust standard errors for a given variable representing the clusters. This produces easily readable output in a format you would expect from regression results.
After asking around on BlueSky (thanks P. V. Sundar Balakrishnan) and trying to problem solve with chatGPT, here is the closest I got to reproducing the Stata output for just the coefficient x, not even the thresholds, using R:
library(MASS)
library(sandwich)
library(lmtest)
results <- polr(y ~ x, data = df, Hess = TRUE)
summary(results)
robust_se <- sandwich::vcovCL(results, cluster = df$cluster_variable)
robust_results <- coeftest(results, vcov = robust_se)
print(robust_results)
odds_ratios_coef <- exp(coef(results))
robust_se_coef <- sqrt(diag(robust_se))
robust_se_odds_ratios_coef <- robust_se_coef * exp(coef(results))
z_value <- qnorm(0.975)
lower_ci_coef <- odds_ratios_coef - z_value * robust_se_odds_ratios_coef
upper_ci_coef <- odds_ratios_coef + z_value * robust_se_odds_ratios_coef
odds_ratios_coef_with_ci <- data.frame(
Term = names(odds_ratios_coef),
Odds_Ratio = odds_ratios_coef,
CI_Lower = lower_ci_coef,
CI_Upper = upper_ci_coef
)
odds_ratios_with_ci <- odds_ratios_coef_with_ci[!duplicated(odds_ratios_coef_with_ci$Term), ]
print(odds_ratios_with_ci)
In comparing the results using some actual data, they are close but not exact:
Stata:
---------------------------------------------------------------------------------
| Robust
Concern_Suicide | Odds ratio std. err. z P>|z| [95% conf. interval]
----------------+----------------------------------------------------------------
year_fixed | .6939064 .0441584 -5.74 0.000 .6125375 .7860844
----------------+----------------------------------------------------------------
/cut1 | -1.747472 .101974 -1.947337 -1.547606
/cut2 | -.6400995 .0966894 -.8296072 -.4505917
/cut3 | .4979265 .098837 .3042095 .6916435
---------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation to odds ratios.
R:
Term Odds_Ratio CI_Lower CI_Upper
1 year_fixed 0.6939063 0.6073575 0.7804551
If anyone has a more straight forward way to do this in R, let me know! This definitely seems like a scenario where Stata's implementation of ordered logistic regression is far more usable than R's.