stata

Why Stata in 2024?

Erik Reinbergs

05 May 2024 • 2 min read

I was lucky enough to be exposed to both R and Stata in grad school. However, I definitely understood Stata before I understood R—so my mind still tends to default to Stata logic. Now I try to keep reasonably up to date on both and routinely use both. Every so often I have the thought that I should ditch one or the other, double down, and save myself the trouble of bouncing between them. Of the two, R is more popular in my field and with my collaborators. Plus, it's free and open source. And R is likely a more transferable skill at this point outside of academia. There are many strengths and advantages to R (which are not reviewed here), so, why don't I ditch Stata?

Stata still has some advantages that I don't want to totally give up. This is coming from the perspective of someone who is a substantive researcher not primarily a methodologist, statistician, or programmer.

Simple, concise, and consistent syntax. The syntax of R is more complex, longer, and less consistent between packages. The tidyverse has definitely helped this issue in R, but to me, the Stata syntax is still much more straightforward.
Documentation. Stata's documentation is just so good. Not only does the documentation describe the commands and show examples, it has excellent overviews of the methods it's implementing and the options for those methods. Compared to R help files, Stata's documentation for me is so much more useful.
Version control and backward compatibility. I see lots of posts on how to best version control R and the hoops people jump through to try to ensure reproducibility. This is so much less of an issue in Stata. One can set the version of Stata in one line of your analysis .do file.
Core functionality vs. packages. In R, multiple packages are typically called to complete an analysis - which itself is not a problem. However, sometimes packages do not play well together or it is not immediately clear why output from one function does not immediately work with a different function or how to fix this.
Trust in results. I have full trust in base R and many of the well established R packages, but my trust in the accuracy of results does decrease a little as additional, less mature packages are added to an analysis. It is definitely easier to trust the quality control process of Stata. Of course, due to the open source nature of R packages, anyone with enough statistics and programming knowledge (and time) can verify the accuracy of any R package. I unfortunately do not have those skills.
Standardization. This is admittedly a double-edged sword. R is powerful because of the maaaany different ways to accomplish tasks of all kinds depending on your preference and need. In Stata, there is often one standard approach for something (although of course other approaches are possible). This makes it much quicker to learn (i.e., not needing to decide on base R vs. dplyr vs. data.table or become familiar with all three based on what different collaborators are using).
Metadata. Stata is very good at handling datasets with notes, variable labels, value labels, and multiple missing data type labels. This is a major advantage when working with secondary datasets. This is not as well implemented in R. Kyle Husmann has a good blog post explaining this issue: Three Reasons Education And Social Scientists Prefer Proprietary Software And Data Formats.

At the risk of setting off a flame war, my unserious hot take here is that using R is to using Linux as using Stata is to using macOS. And I am admittedly more of a macOS user than a Linux user.