In this post, I discuss how IDinsight’s DSEM team implemented a data quality check for pairs of variables displaying linear relationships. We explain why we chose a Bayesian model, and the tweaks we made to address the hierarchy and outliers in data. We will also see how to implement the model using PyMC3.
Photo credit: Carlos Castilla on iStock by Getty Images
Last year, IDinsight built a tool for a national government agency to automatically check the quality of the data reported by dozens of district governments every month. Some of these checks were simple, like checking for null values, illegal values, and outliers.
Another check we implemented was what we call “ratio checks”.1 We observed that many of the variables consistently displayed linear relationships in pairs. There were logical reasons for these relationships, too. For example, the number of institutional deliveries and number of antenatal care registrations each month would be similar since those who registered for antenatal care will likely seek institutional help when giving birth. We wanted to check that new values would also follow these patterns.
We had three requirements for this check.
Requirement 1: The results should be intuitive. Something like a probability that a new point is anomalous given past data would be useful.
Requirement 2: These linear relationships look slightly different for different districts. We want an efficient model that takes into account both similarities and differences across the districts.
Requirement 3: Past data contains outliers. There might have been data entry errors or a rare event in certain districts that resulted in one variable taking an extreme value relative to the other variable. We want the model to be robust to occasional outliers.
Below, we discuss how robust hierarchical Bayesian linear regression addressed these concerns, and show how to implement it using PyMC3.
Requirement 1: The results should be intuitive.
For a pair of variables x,y with a linear relationship, we want to estimate p(y|x) based on past data. We can flag a point (xi,yi) as an outlier if it’s unlikely that we observe a value as extreme as yi.
We could fit a simple linear regression model:
y=α+βx+ϵ, ϵ∼Normal (0,σ2 )
which can be viewed as
Y|x∼Normal (α+βx,σ2 )
In Bayesian statistics, we want to find the distribution of the model parameters θ={α,β,σ} given the data that we already observed. This is called the posterior, written p(θ|y)2.2
Once we have the posterior, we can evaluate how likely it is to see some new data y’ (given x’) by integrating p(y’|θ) over the posterior. This distribution p(y’|y), is called the posterior predictive distribution.
To get the posterior, we first set prior distributions over the parameters as follows:
α ∼ Normal(0,5) β ∼ Normal(0,2) σ ∼ HalfNormal(5)
This represents our initial belief about what values of the parameters are likely.
Now, we simulate the posterior using MCMC sampling, which is simple with PyMC3:
The trace contains samples from the posterior distribution, and its histogram gives us an approximation of the posterior.
Definitions
Now that we have the posterior p(θ|y) we can approximate the posterior predictive distribution p(y’|y) for new data y’: We first sample parameters from the posterior, θ’∼p(θ|y), and then sample from p(y’|θ’).
To sample from the posterior predictive in PyMC3, we have to modify the code slightly. We need to tell the model that x is training data (see 0. Data) that can be replaced with some test data later on(see 4. Test data).
ppc[“y”] contains samples drawn from the posterior predictive distribution. Using this we can compute credible intervals at various probabilities. For example, for a given x’, there is 95% probability that the corresponding y’ falls within the 95% credible interval (ya,yb), i.e. p(ya<y'<yb |x’)=95%.3.3 If y’ falls outside this range, we would flag the data point as anomalous.4
Requirement 2: We want an efficient model that takes into account both similarities and differences across the districts.
In our data, the same pair of variables had slightly different intercepts and variances across different districts. A single linear model fit across the entire dataset would ignore this heterogeneity. On the other hand, if we fit a different linear model for each district, we’d lose out on the shared patterns across the districts and might end up with noisy estimates.
Rather than a single global model or dozens of individual models, we chose a hierarchical (also known as partial pooling or multilevel) model, where we assume that the parameters of pi (y|x) for each group (in our case, district) come from a common distribution.
In the first level, we set hyperpriors for the global means (μα,μβ)]and variances (σα,σβ) of group-wise intercepts (αi’s) and group-wise slopes (βi’s).
μα ∼ Normal(0,2) σα ∼ HalfNormal(5) μβ ∼ Normal(1,2) σβ ∼ HalfNormal(5)
In the next level, the parameters above are plugged into the group-wise priors. This is where we take the group-wise variations into account. For each group i the priors are
αi ∼ Normal(μα,σα2) βi ∼ Normal(μβ,σβ2)
Then our likelihood for each observation j in group i is
Yij |xij ∼ Normal(αi+βi xij,σ2)
Requirement 3: We want the model to be robust to occasional outliers.
Another issue was that the data contained outliers, either due to data entry errors or some rare events. We wanted a model that could robustly handle those points.
We achieved this by changing the likelihood distribution from Gaussian to Student’s T:
Yi |xi ∼ StudentT(αi+βi x,σ,ν)
Student’s T distribution looks a lot like Gaussian but is more heavy-tailed.5 This means that a few outliers are more likely, and hence will not skew the parameters so much.
Combining this with the hierarchical structure, our code looked like the following. Note the priors for each level of the hierarchy, and that we use shape=n_groups to create district-wise parameters. We also set priors over the parameters for Student’s T distribution.
The existing data quality checks were running on Airflow, with a single data quality check running as a single pipeline. Since inference requires lots of sampling time, we decided to create a separate inference (or “model fitting”) pipeline just for ratio checks. This way, if an error occurs or if the sampling doesn’t converge, we can address the issue without blocking the check, which can be run using the latest stable model.
We defined a regressor class with fit and predict methods. In the model fitting pipeline, the fit method is called on existing data, and a fitted model containing the trace is uploaded to an S3 bucket. In the data quality check pipeline, the model is pulled from S3 and the predict method is called to estimate the posterior predictive distribution for y We check if the actual values of y fall within the HDI of chosen width. See below for a sketch of this implementation.
These pipelines can then run on different schedules as required.
We created a data quality check called “ratio check” for linearly related variables. In this check, data points that are unlikely given the historical linear relationship would be marked as outliers. To quantify that unlikeliness, we fit a Bayesian model. Since the linear relationships were distinct yet related across different groups, we adopted a hierarchical structure. To handle outliers in the past data, we used Student’s T likelihood instead of the usual Gaussian likelihood. Finally, we wrapped this process in a class with a scikit-learn style API with fit and predict methods and deployed it through two Airflow pipelines: one pipeline calls fit to regularly train an up-to-date model, and another pipeline calls predict when the data quality check is triggered.
In this process, we ran into a few difficulties, one of them being the sampling time. To get a more accurate picture of the posterior, we needed to sample thousands of times for each parameter. We also had to try a different hierarchical parametrization for the sampling to converge. It also turned out that using a discrete prior, such as Poisson, for the degrees of freedom in Student’s T distribution was sufficient, rather than using a continuous prior like HalfCauchy, which led to much slower convergence.
Another tricky part was deploying PyMC3 on Airflow, likely because of the way PyMC3 uses theano on the backend. We learned that the model needs to be rebuilt each time before sampling, and that the only thing that should be saved is the trace. Because we run multiple ratio check pipelines in parallel inside the same docker container, we had to make sure that the theano cache was reset before each sampling.
Of course, our choice of the model only works if the pairs of variables consistently display linear relationships, across all groups. We picked the pairs for ratio checks based on not just empirical linear relationships but based on whether it is logical or reasonable for them to be linearly related. However, some external shock could change this relationship, as it seemed to have happened in a few of the districts. In one case, one variable continued to increase while the other one lagged behind, changing the slope of the relationship. In another, one of the variables displayed a step change, changing the intercept of the linear model. We will need a more flexible model for these cases.
Currently, we run model diagnostics manually (such as looking at trace plots, effective sample sizes, number of divergences, and Ȓ values), but these should be integrated into the model fitting pipeline. If any of the diagnostics indicate potential divergence, we can trigger a notification for human inspection.
Follow along this notebook to fit the simple, hierarchical, and robust hierarchical models.
There are a number of details and interesting detours we didn’t mention, but here are some resources for you to dig deeper:
6 December 2024
5 December 2024
4 December 2024
3 December 2024
14 April 2022
29 September 2021
24 November 2022
14 February 2024