# Data quality check for hierarchical linear data with outliers

*In this post, I discuss how **IDinsight’s DSEM team** implemented a data quality check for pairs of variables displaying linear relationships. We explain why we chose a Bayesian model, and the tweaks we made to address the hierarchy and outliers in data. We will also see how to implement the model using PyMC3.*

Last year, IDinsight built a tool for a national government agency to automatically check the quality of the data reported by dozens of district governments every month. Some of these checks were simple, like checking for null values, illegal values, and outliers.

Another check we implemented was what we call “ratio checks”.^{1} We observed that many of the variables consistently displayed linear relationships in pairs. There were logical reasons for these relationships, too. For example, the number of institutional deliveries and number of antenatal care registrations each month would be similar since those who registered for antenatal care will likely seek institutional help when giving birth. We wanted to check that new values would also follow these patterns.

We had three requirements for this check.

**Requirement 1: **The results should be intuitive. Something like a probability that a new point is anomalous given past data would be useful.

**Requirement 2:** These linear relationships look slightly different for different districts. We want an efficient model that takes into account both similarities and differences across the districts.

**Requirement 3:** Past data contains outliers. There might have been data entry errors or a rare event in certain districts that resulted in one variable taking an extreme value relative to the other variable. We want the model to be robust to occasional outliers.

Below, we discuss how robust hierarchical Bayesian linear regression addressed these concerns, and show how to implement it using PyMC3.

#### Quantify uncertainty using a Bayesian model

**Requirement 1:** The results should be intuitive.

For a pair of variables **x,y** with a linear relationship, we want to estimate **p(y|x)** based on past data. We can flag a point **(x _{i},y_{i}) ** as an outlier if it’s unlikely that we observe a value as extreme as

**y**.

_{i}We could fit a simple linear regression model:

**y=α+βx+ϵ, ϵ∼Normal (0,σ ^{2} )**

which can be viewed as

**Y|x∼Normal (α+βx,σ ^{2} )**

In Bayesian statistics, we want to find the distribution of the model parameters **θ={α,β,σ}** given the data that we already observed. This is called the *posterior*, written **p(θ|y) ^{2}**.

^{2}

Once we have the posterior, we can evaluate how likely it is to see some new **data y’ (given x’)** by integrating** p(y’|θ)** over the posterior. This distribution **p(y’|y)**, is called the *posterior predictive distribution*.

To get the posterior, we first set prior distributions over the parameters as follows:

**α ∼ Normal(0,5)
β ∼ Normal(0,2)
σ ∼ HalfNormal(5)**

This represents our initial belief about what values of the parameters are likely.

Now, we simulate the posterior using MCMC sampling, which is simple with PyMC3:

The trace contains samples from the posterior distribution, and its histogram gives us an approximation of the posterior.

**Definitions**

**Posterior distribution (p(θ|y)) –**Distribution over model parameters given the observed data**Posterior predictive distribution (p(y’|y)) –**Distribution over unobserved data based on the observed data and the posterior distribution**Credible interval –**Interval into which an unobserved parameter falls with a given probability**Highest density interval (HDI) –**The narrowest credible interval for a given probability

#### Making predictions

Now that we have the posterior **p(θ|y)** we can approximate the posterior predictive distribution **p(y’|y)** for new data **y’**: We first sample parameters from the posterior, **θ’∼p(θ|y)**, and then sample from **p(y’|θ’)**.

To sample from the posterior predictive in PyMC3, we have to modify the code slightly. We need to tell the model that **x** is training data (see 0. Data) that can be replaced with some test data later on(see 4. Test data).

ppc[“y”] contains samples drawn from the posterior predictive distribution. Using this we can compute *credible intervals* at various probabilities. For example, for a given **x’**, there is 95% probability that the corresponding** y’** falls within the 95% credible interval **(y _{a},y_{b})**, i.e.

**p(y**.

_{a}<y'<y_{b}|x’)=95%.^{3}^{3}If

**y’**falls outside this range, we would flag the data point as anomalous.

^{4}

#### A hierarchical model for hierarchical data

**Requirement 2:** We want an efficient model that takes into account both similarities and differences across the districts.

In our data, the same pair of variables had slightly different intercepts and variances across different districts. A single linear model fit across the entire dataset would ignore this heterogeneity. On the other hand, if we fit a different linear model for each district, we’d lose out on the shared patterns across the districts and might end up with noisy estimates.

Rather than a single global model or dozens of individual models, we chose a hierarchical (also known as partial pooling or multilevel) model, where we assume that the parameters of** p _{i} (y|x)** for each group (in our case, district) come from a common distribution.

In the first level, we set *hyperpriors* for the global means **(μ _{α},μ_{β})]**and variances

**(σ**of group-wise intercepts

_{α},σ_{β})**(α**and group-wise slopes

_{i}’s)**(β**.

_{i}’s)**μ _{α} ∼ Normal(0,2)
σ_{α} ∼ HalfNormal(5)
μ_{β} ∼ Normal(1,2)
σ_{β} ∼ HalfNormal(5)**

In the next level, the parameters above are plugged into the group-wise priors. This is where we take the group-wise variations into account. For each group** i** the priors are

**α _{i} ∼ Normal(μ_{α},σ_{α}^{2})**

β_{i} ∼ Normal(μ_{β},σ_{β}^{2})

Then our likelihood for each observation **j** in group **i** is

**Y _{ij} |x_{ij} ∼ Normal(α_{i}+β_{i} x_{ij},σ^{2})**

#### Robust regression for data with outliers

**Requirement 3:** We want the model to be robust to occasional outliers.

Another issue was that the data contained outliers, either due to data entry errors or some rare events. We wanted a model that could robustly handle those points.

We achieved this by changing the likelihood distribution from Gaussian to Student’s T:

**Y _{i} |x_{i} ∼ StudentT(α_{i}+β_{i} x,σ,ν)**

Student’s T distribution looks a lot like Gaussian but is more heavy-tailed.^{5} This means that a few outliers are more likely, and hence will not skew the parameters so much.

Combining this with the hierarchical structure, our code looked like the following. Note the priors for each level of the hierarchy, and that we use shape=n_groups to create district-wise parameters. We also set priors over the parameters for Student’s T distribution.

#### Deployment

The existing data quality checks were running on Airflow, with a single data quality check running as a single pipeline. Since inference requires lots of sampling time, we decided to create a separate inference (or “model fitting”) pipeline just for ratio checks. This way, if an error occurs or if the sampling doesn’t converge, we can address the issue without blocking the check, which can be run using the latest stable model.

We defined a regressor class with fit and predict methods. In the model fitting pipeline, the fit method is called on existing data, and a fitted model containing the trace is uploaded to an S3 bucket. In the data quality check pipeline, the model is pulled from S3 and the predict method is called to estimate the posterior predictive distribution for **y** We check if the actual values of **y** fall within the HDI of chosen width. See below for a sketch of this implementation.

These pipelines can then run on different schedules as required.

#### Final thoughts

We created a data quality check called “ratio check” for linearly related variables. In this check, data points that are unlikely given the historical linear relationship would be marked as outliers. To quantify that unlikeliness, we fit a Bayesian model. Since the linear relationships were distinct yet related across different groups, we adopted a hierarchical structure. To handle outliers in the past data, we used Student’s T likelihood instead of the usual Gaussian likelihood. Finally, we wrapped this process in a class with a scikit-learn style API with fit and predict methods and deployed it through two Airflow pipelines: one pipeline calls fit to regularly train an up-to-date model, and another pipeline calls predict when the data quality check is triggered.

#### Challenges

In this process, we ran into a few difficulties, one of them being the sampling time. To get a more accurate picture of the posterior, we needed to sample thousands of times for each parameter. We also had to try a different hierarchical parametrization for the sampling to converge. It also turned out that using a discrete prior, such as Poisson, for the degrees of freedom in Student’s T distribution was sufficient, rather than using a continuous prior like HalfCauchy, which led to much slower convergence.

Another tricky part was deploying PyMC3 on Airflow, likely because of the way PyMC3 uses theano on the backend. We learned that the model needs to be rebuilt each time before sampling, and that the only thing that should be saved is the trace. Because we run multiple ratio check pipelines in parallel inside the same docker container, we had to make sure that the theano cache was reset before each sampling.

#### Future Improvements

Of course, our choice of the model only works if the pairs of variables consistently display linear relationships, across all groups. We picked the pairs for ratio checks based on not just empirical linear relationships but based on whether it is logical or reasonable for them to be linearly related. However, some external shock could change this relationship, as it seemed to have happened in a few of the districts. In one case, one variable continued to increase while the other one lagged behind, changing the slope of the relationship. In another, one of the variables displayed a step change, changing the intercept of the linear model. We will need a more flexible model for these cases.

Currently, we run model diagnostics manually (such as looking at trace plots, effective sample sizes, number of divergences, and **Ȓ** values), but these should be integrated into the model fitting pipeline. If any of the diagnostics indicate potential divergence, we can trigger a notification for human inspection.

#### Resources

Follow along this notebook to fit the simple, hierarchical, and robust hierarchical models.

There are a number of details and interesting detours we didn’t mention, but here are some resources for you to dig deeper:

- Prior and Posterior Predictive Checks: a PyMC3 tutorial for checking that our priors give a reasonable predictive distribution, before we look at the data (prior predictive check), and checking if the posterior predictive distribution is reasonable (posterior predictive check).
- GLM: Robust Regression using Custom Likelihood for Outlier Classification: a PyMC3 tutorial on outlier detection. While the robust regression was sufficient for our case, the Hogg method, which uses a mixture of a linear model for inliers and another model for outliers, might be more appropriate in other cases.
- Hierarchical Partial Pooling: a PyMC3 tutorial for hierarchical models.
- Why hierarchical models are awesome, tricky, and Bayesian: Thomas Wiecki writes about why one parametrization of hierarchical model might work better than another.
- For a comprehensive reference on Bayesian modelling, see Bayesian Data Analysis by Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin.

## Authors

### IDinsight’s five most-read blogs of 2022

30 December 2022

### The Status of Women in Leadership in Economics and Financial Services in Kenya, Ethiopia, Nigeria, and India

21 December 2022

### Un plaidoyer pour une utilisation accrue des données probantes

21 December 2022

### A case of embracing evidence

21 December 2022

### From potential to progress

14 December 2022

### Gathering Allies Worldwide: Dignity Initiative Reflections for 2022

13 December 2022

### Fostering meaningful impact through partnerships

13 December 2022

### The Dignity Report 2022

8 December 2022

### Building the foundation for inclusive recovery and development

8 December 2022

### Meeting the challenges of scarcity with the abundance of knowledge

7 December 2022

- 1. This is a misnomer, since we’re checking for the linear relationship and not the ratio of the variables, but we have yet to come up with an appropriate name!
- 2. Technically, the posterior is conditioned on the observed and posterior predictive distribution is conditioned on both observed and new values, but we omit them for brevity.
- 3. An N% credible interval can be found over many different ranges of y. We choose the highest density interval (HDI), which captures the same probability over the smallest range of y.
- 4. This intuitive interpretation is in contrast to confidence intervals in frequentist settings. There, a 95% confidence interval only guarantees that the procedure to compute this interval will capture the correct parameter 95% of the time, but we can’t say anything about realized values.
- 5. In fact, as the degrees of freedom approaches infinity the Student’s T distribution approaches Normal (0,1).

## Related content

### The reality behind a machine learning dataset – Part 2: Practical learnings for development, engineering, and data science

24 November 2022

### 7 pieces of wisdom for building a great monitoring dashboard

14 April 2022

### Using algorithms to match eligible people to social benefits in India

29 September 2021

### Using AI to improve maternal health chatbots in South Africa

14 November 2022