Skip to content
Blog

Right tool, right time: statistical software at IDinsight

IDinsight’s Elizabeth Bennett codes in a government school in Rajasthan, India during the Educate Girls Development Impact Bond baseline. ©IDinsight/Kate Sturla

IDinsight’s analytical approach varies based on the problem a client asks us to help solve but is always rooted in a rigorous technical approach. There are a wide number of types of statistical software we could deploy to conduct our analyses, and we continuously think about how to use the right tool at the right time.

Stata has usually been our software of choice for project work. But there are cases where free programming languages like R and Python are better suited to the task at hand. In this post, we discuss the costs and benefits of these three kinds of software — Stata, R, and Python — and how we make the most of their distinct strengths to solve different client problems.

Stata: Manage data and do most statistical tasks with relatively low learning costs

Stata is our preferred software for most day-to-day econometric tasks. These include calculating statistical power, cleaning data, running regressions and interpreting their outputs, and creating data visualizations. Because of Stata’s intuitive nature, straightforward syntax, and easy-to-use help files, it has gained popularity in academic circles as well as the wider global development sector. Many of our teammates have used Stata as part of their academic training prior to joining IDinsight, and many of our partners also tend to be familiar with it, which facilitates collaboration.

Related to this, there is an element of path dependency in our adherence to Stata. Early IDinsighters were proficient in Stata and, recognizing the benefits of having everyone in the organization be familiar with the same software, they established Stata as the organizational standard. New cohorts were trained on Stata during technical orientation and were required to use Stata on the projects that they were joining. While we have considered switching to different statistical software (like R) at various times, we have kept Stata as our default statistical tool since the organization-wide switching costs were higher than the perceived benefits. Instead of switching, we have gradually added other tools and programming languages to our toolkit that offer distinct advantages for certain applications.

R: Organize workflow easily and create graphics more intuitively, but with higher learning costs

R has seen a notable rise as a tool for data science in recent years, but it is more difficult to learn than Stata. It takes longer for non-computer scientists to become comfortable with R’s object-oriented nature — simple data cleaning in R can be tricky to understand, especially when transitioning from Stata’s straightforward syntax. Documentation in Stata is also more detailed than in R. For these reasons, R requires much more intensive upfront training than Stata for team members initially unfamiliar with statistical programming. Stata also has built-in packages to do certain kinds of statistical analysis that you cannot do easily in R, e.g. calculation of heteroskedasticity-robust standard errors.

That said, R has advantages over Stata in some areas. The first is its integration with Markdown. Analyses conducted in Markdown-based documents have improved project flow, navigation, and reproducibility. With RMarkdown, everything from data cleaning to data analysis and reporting can be done in one all-inclusive document with collapsible headings and subheadings. Using self-contained documents with end-to-end analysis of data (along with narrative text, charts, and tables) can increase the quality of project work by ensuring that it is transparent and reproducible. Stata currently doesn’t have a default option to create sections in the same code file — instead, we create a “master.do” file that calls other .do files. While Stata Markdown theoretically has the same capabilities as RMarkdown, it is currently non-ideal because it is difficult to check for compilation errors.

Additionally, we have found R’s ggplot2 and shiny packages provide more intuitive and extensible graphics capabilities than Stata. ggplot2, based on a unified “grammar of graphics”, allows us to intuitively build graphs using modular graphical elements. R offers comprehensive options to customize graphs, from creating rules to selectively color data points to automatically ensuring that bars don’t cover bar labels. While Stata offers similar capabilities, the grammar of graphics in R makes it intuitive to understand how to make these modifications. We have also found it easier in R than in Stata to write concise code to implement customizations for graphs, without needing to copy-paste code snippets. Thanks to shiny, analysis and visualization in R can be interactive, app-based, and website-compatible during creation — offering possibilities that were hitherto doable only by costly drag-and-drop software. For example, when using geospatial data, shiny makes it easy to overlay maps with different subsets of the population and carry out preliminary data analysis online.

Thus, while we continue to use Stata for nearly all of our project work, we also use R when we find it suits our needs better. Some teams use Stata for analyses because Stata’s functions are usually better suited to the project needs, but use R to make graphics. Other teams use R for project work beyond graphics because of features like Markdown.

Figure 1: Sample graph made using ggplot2

Python: Use specifically for data scraping and machine learning

For additional program needs, other tools are most appropriate. We do much of our work related to data scraping and machine learning in Python, a high-level programming language widely known for its simple syntax.

We often need to scrape data (e.g. village lists for sampling, estimates to benchmark our results to, ancillary data to make a model stronger), and we have found Python to be fast, powerful, customizable, and easy to learn and use for this purpose.

We have used Python for machine learning in various projects, such as predicting the number of out of school girls in North India. The main reason we use Python is because it is quickly becoming the most common language for machine learning across industries. Machine learning algorithms rely on very large datasets to train on, which makes writing in a language with good data management and optimized algorithmic efficiency incredibly important. Common Python packages such as NumPypandassci-kit learn, and TensorFlow are constantly being updated to be better in these domains. By writing machine learning code in Python, we can leverage these innovations and spend time on the substantive parts of machine learning which are project- and domain-specific, such as feature engineering, model selection, and model tuning. Furthermore, there are a growing number of packages that can implement a wide variety of “supervised” machine learning algorithms, making it easier to get started.

Python’s simple syntax, its growing popularity across industries, and the number of pre-existing packages make Python a very popular tool to carry out machine learning projects in the global development context. However, because of Python’s relatively steep learning curve and less specific applicability to applied statistics than Stata or R, we do not typically use Python for most other data cleaning and analysis tasks.

Going forward…

As we continue to think about the right measurement and evaluation tools to apply to various problems in development, we will continue to research and test the effectiveness of different programming languages for each of these tools. New developments may also shift the advantages and disadvantages of different languages. For example, because R is free, the research community may gravitate toward R and away from Stata in the future, meaning that team members who join us from university and our academic partners may have more familiarity with R than Stata.

Each tool has specific applications and so we have found it valuable to add each to our organizational toolkit — without removing any. In the meantime, we welcome comments from others in the sector to continue this conversation: which languages have worked best for you, and for what purposes?