Skip to content
Blog

Data quality check for hierarchical linear data with outliers

Suzin You 2 May 2022

In this post, I discuss how IDinsight’s DSEM team implemented a data quality check for pairs of variables displaying linear relationships. We explain why we chose a Bayesian model, and the tweaks we made to address the hierarchy and outliers in data. We will also see how to implement the model using PyMC3.

Photo credit: Carlos Castilla on iStock by Getty Images

Last year, IDinsight built a tool for a national government agency to automatically check the quality of the data reported by dozens of district governments every month. Some of these checks were simple, like checking for null values, illegal values, and outliers.

Another check we implemented was what we call “ratio checks”.1 We observed that many of the variables consistently displayed linear relationships in pairs. There were logical reasons for these relationships, too. For example, the number of institutional deliveries and number of antenatal care registrations each month would be similar since those who registered for antenatal care will likely seek institutional help when giving birth. We wanted to check that new values would also follow these patterns.

For a pair of variables with a linear relationship, we want to check that a new point (x, y) follows the same pattern.

We had three requirements for this check.

Requirement 1: The results should be intuitive. Something like a probability that a new point is anomalous given past data would be useful.

Requirement 2: These linear relationships look slightly different for different districts. We want an efficient model that takes into account both similarities and differences across the districts.

Requirement 3: Past data contains outliers. There might have been data entry errors or a rare event in certain districts that resulted in one variable taking an extreme value relative to the other variable. We want the model to be robust to occasional outliers.

Below, we discuss how robust hierarchical Bayesian linear regression addressed these concerns, and show how to implement it using PyMC3.

Quantify uncertainty using a Bayesian model

Requirement 1: The results should be intuitive.

For a pair of variables x,y with a linear relationship, we want to estimate p(y|x) based on past data. We can flag a point (xi,yi)  as an outlier if it’s unlikely that we observe a value as extreme as yi.

We could fit a simple linear regression model:

y=α+βx+ϵ,  ϵ∼Normal (0,σ2 )

which can be viewed as

Y|x∼Normal (α+βx,σ2 )

In Bayesian statistics, we want to find the distribution of the model parameters θ={α,β,σ} given the data that we already observed. This is called the posterior, written p(θ|y)2.2

Once we have the posterior, we can evaluate how likely it is to see some new data y’ (given x’) by integrating p(y’|θ) over the posterior. This distribution p(y’|y), is called the posterior predictive distribution.

To get the posterior, we first set prior distributions over the parameters as follows:

α ∼ Normal(0,5)
β ∼ Normal(0,2)
σ ∼ HalfNormal(5)

This represents our initial belief about what values of the parameters are likely.

Now, we simulate the posterior using MCMC sampling, which is simple with PyMC3:

The trace contains samples from the posterior distribution, and its histogram gives us an approximation of the posterior.

Definitions

  • Posterior distribution (p(θ|y)) – Distribution over model parameters given the observed data
  • Posterior predictive distribution (p(y’|y)) – Distribution over unobserved data based on the observed data and the posterior distribution
  • Credible interval – Interval into which an unobserved parameter falls with a given probability
  • Highest density interval (HDI) – The narrowest credible interval for a given probability

Making predictions

Now that we have the posterior p(θ|y) we can approximate the posterior predictive distribution p(y’|y) for new data y’: We first sample parameters from the posterior, θ’∼p(θ|y), and then sample from p(y’|θ’).

To sample from the posterior predictive in PyMC3, we have to modify the code slightly. We need to tell the model that x is training data (see 0. Data) that can be replaced with some test data later on(see 4. Test data).

ppc[“y”] contains samples drawn from the posterior predictive distribution. Using this we can compute credible intervals at various probabilities. For example, for a given x’, there is 95% probability that the corresponding y’ falls within the 95% credible interval (ya,yb), i.e. p(ya<y'<yb |x’)=95%.3.3 If y’ falls outside this range, we would flag the data point as anomalous.4

95% highest density credible interval (HDI) of the posterior predictive distribution p(y’|y) (in the orange area) compared to original data.

A hierarchical model for hierarchical data

Requirement 2: We want an efficient model that takes into account both similarities and differences across the districts.

In our data, the same pair of variables had slightly different intercepts and variances across different districts. A single linear model fit across the entire dataset would ignore this heterogeneity. On the other hand, if we fit a different linear model for each district, we’d lose out on the shared patterns across the districts and might end up with noisy estimates.

Toy hierarchical data. Each group displays a linear pattern, but with differing slopes and variances.

Rather than a single global model or dozens of individual models, we chose a hierarchical (also known as partial pooling or multilevel) model, where we assume that the parameters of pi (y|x) for each group (in our case, district) come from a common distribution.

In the first level, we set hyperpriors for the global means αβ)]and variances αβ) of group-wise intercepts i’s) and group-wise slopes i’s).

μα ∼ Normal(0,2)
σα ∼ HalfNormal(5)
μβ ∼ Normal(1,2)
σβ ∼ HalfNormal(5)

In the next level, the parameters above are plugged into the group-wise priors. This is where we take the group-wise variations into account. For each group i the priors are

αi ∼ Normal(μαα2)
βi ∼ Normal(μββ2)

Then our likelihood for each observation j in group i is

Yij |xij ∼ Normal(αii xij2)

A hierarchical model
95% HDI of the posterior predictive distribution (orange area) of test data (blue markers).

Robust regression for data with outliers

Requirement 3: We want the model to be robust to occasional outliers.

Another issue was that the data contained outliers, either due to data entry errors or some rare events. We wanted a model that could robustly handle those points.

Linear, but with a few outliers

We achieved this by changing the likelihood distribution from Gaussian to Student’s T:

Yi |xi ∼ StudentT(αii x,σ,ν)

Student’s T distribution looks a lot like Gaussian but is more heavy-tailed.5 This means that a few outliers are more likely, and hence will not skew the parameters so much.

Combining this with the hierarchical structure, our code looked like the following. Note the priors for each level of the hierarchy, and that we use shape=n_groups to create district-wise parameters. We also set priors over the parameters for Student’s T distribution.

Deployment

The existing data quality checks were running on Airflow, with a single data quality check running as a single pipeline. Since inference requires lots of sampling time, we decided to create a separate inference (or “model fitting”) pipeline just for ratio checks. This way, if an error occurs or if the sampling doesn’t converge, we can address the issue without blocking the check, which can be run using the latest stable model.

We defined a regressor class with fit and predict methods. In the model fitting pipeline, the fit method is called on existing data, and a fitted model containing the trace is uploaded to an S3 bucket. In the data quality check pipeline, the model is pulled from S3 and the predict method is called to estimate the posterior predictive distribution for y We check if the actual values of y fall within the HDI of chosen width. See below for a sketch of this implementation.

These pipelines can then run on different schedules as required.

Airflow pipelines of 1) model fitting and 2) prediction as part of data quality checks.

Final thoughts

We created a data quality check called “ratio check” for linearly related variables. In this check, data points that are unlikely given the historical linear relationship would be marked as outliers. To quantify that unlikeliness, we fit a Bayesian model. Since the linear relationships were distinct yet related across different groups, we adopted a hierarchical structure. To handle outliers in the past data, we used Student’s T likelihood instead of the usual Gaussian likelihood. Finally, we wrapped this process in a class with a scikit-learn style API with fit and predict methods and deployed it through two Airflow pipelines: one pipeline calls fit to regularly train an up-to-date model, and another pipeline calls predict when the data quality check is triggered.

Challenges

In this process, we ran into a few difficulties, one of them being the sampling time. To get a more accurate picture of the posterior, we needed to sample thousands of times for each parameter. We also had to try a different hierarchical parametrization for the sampling to converge. It also turned out that using a discrete prior, such as Poisson, for the degrees of freedom in Student’s T distribution was sufficient, rather than using a continuous prior like HalfCauchy, which led to much slower convergence.

Another tricky part was deploying PyMC3 on Airflow, likely because of the way PyMC3 uses theano on the backend. We learned that the model needs to be rebuilt each time before sampling, and that the only thing that should be saved is the trace. Because we run multiple ratio check pipelines in parallel inside the same docker container, we had to make sure that the theano cache was reset before each sampling.

Future Improvements

Of course, our choice of the model only works if the pairs of variables consistently display linear relationships, across all groups. We picked the pairs for ratio checks based on not just empirical linear relationships but based on whether it is logical or reasonable for them to be linearly related. However, some external shock could change this relationship, as it seemed to have happened in a few of the districts. In one case, one variable continued to increase while the other one lagged behind, changing the slope of the relationship. In another, one of the variables displayed a step change, changing the intercept of the linear model. We will need a more flexible model for these cases.

Currently, we run model diagnostics manually (such as looking at trace plots, effective sample sizes, number of divergences, and Ȓ values), but these should be integrated into the model fitting pipeline. If any of the diagnostics indicate potential divergence, we can trigger a notification for human inspection.

Resources

Follow along this notebook to fit the simple, hierarchical, and robust hierarchical models.

There are a number of details and interesting detours we didn’t mention, but here are some resources for you to dig deeper:

  1. Prior and Posterior Predictive Checks: a PyMC3 tutorial for checking that our priors give a reasonable predictive distribution, before we look at the data (prior predictive check), and checking if the posterior predictive distribution is reasonable (posterior predictive check).
  2. GLM: Robust Regression using Custom Likelihood for Outlier Classification: a PyMC3 tutorial on outlier detection. While the robust regression was sufficient for our case, the Hogg method, which uses a mixture of a linear model for inliers and another model for outliers, might be more appropriate in other cases.
  3. Hierarchical Partial Pooling: a PyMC3 tutorial for hierarchical models.
  4. Why hierarchical models are awesome, tricky, and Bayesian: Thomas Wiecki writes about why one parametrization of hierarchical model might work better than another.
  5. For a comprehensive reference on Bayesian modelling, see Bayesian Data Analysis by Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin.
  1. 1. This is a misnomer, since we’re checking for the linear relationship and not the ratio of the variables, but we have yet to come up with an appropriate name!
  2. 2. Technically, the posterior is conditioned on the observed and posterior predictive distribution is conditioned on both observed and new values, but we omit them for brevity.
  3. 3. An N% credible interval can be found over many different ranges of y. We choose the highest density interval (HDI), which captures the same probability over the smallest range of y.
  4. 4. This intuitive interpretation is in contrast to confidence intervals in frequentist settings. There, a 95% confidence interval only guarantees that the procedure to compute this interval will capture the correct parameter 95% of the time, but we can’t say anything about realized values.
  5. 5. In fact, as the degrees of freedom approaches infinity the Student’s T distribution approaches Normal (0,1).