In this second piece in our voter registration database auditing series, Solutions Architect John Dziurłaj describes the methods for performing voter registration database audits.
In my last post, I discussed why establishing an audit trail for voter registration databases (VRDBs) is a crucial first step in our auditing journey. Today, I’ll focus on methodologies that can be used to perform the audit.
The nature of VRDBs as information systems necessitates an auditing approach markedly different from that of, say, a risk limiting audit of election results. While selections on a ballot become fixed when cast, voter registration records are always changing. Thus, in auditing VRDBs, the interest is in ensuring that changes to the database come from legitimate sources and are properly executed.
Anomaly Detection Methods
An exhaustive review of even the most well curated audit trail would be an incredibly time consuming task. I suggest an alternative approach, taken from the data mining space, called anomaly detection. Anomaly detection is the process of identifying data that falls outside an established “normal” range. Anomaly detection can be done via ad hoc analysis, or automated via rule-based, machine learning or statistical methods.
I would love to recommend one method above all others, but, in reality, each one has its own set of advantages and disadvantages. I’ll provide a brief overview of each, so you can better understand the landscape.
Ad-Hoc Manual Analysis
The simplest approach is to perform exploratory analysis against the audit trail. The resulting queries can range from very simple to highly sophisticated. They rely on subject matter expertise to understand the data model and the kinds of patterns that should arouse suspicion. This method is good for exploring what the audit trail has to offer, but it isn’t a sustainable approach.
Rule-based anomaly detection takes the patterns found through ad-hoc analysis, and formalizes those patterns into rules that can invoked on demand. Simple rules can be codified using Structured Query Language (SQL), while more complex queries may require an inference or pattern matching engine. Because simple rules can be written in most relational databases, they pose the lowest barrier to entry. However these rules aren’t "smart," and must continually be refined to adjust to new threat patterns. Failure to do so may result in false negatives.
The next technique is a departure from the rules based approach. Machine learning relies on defining normal activity in the system to serve as a baseline. This “training set” is based on ground truth, i.e. actual “normal” data from the VRDB. As the name implies, the machine learning classifier learns as it sees more and more data. To date there are no publicly available, peer-reviewed classifiers for voter registration data. The rules-based and machine learning approaches can be used together to increase coverage.
The statistical approach has some advantages that make it very appealing. While it does require a deep understanding of applied statistics, the first benefit is that the statistical approach does not require any up-front work; a model is constructed from the dataset using statistical properties. The second advantage is that it has been successfully applied to voter registration systems. Dr. Michael Alverez, Professor of Political Science at Caltech and Co-Director of the CalTech/MIT Voting Technology Project, has piloted audits with Orange County, California using the Interquartile Range (IQR) statistical approach.
Summary and final thoughts
If you feel overwhelmed by the options presented, consider involving a third party. An external audit is a good practice, and has the added benefit of giving the audit an additional layer of scrutiny and reliability.
The following table summarizes each approach:
|Ad Hoc||Manual inspection of audit trail||Knowledge of data model||No|
|Rule-Based||Codified patterns||Knowledge of data model and patterns||Yes|
|Machine Learning||Based on training set input||Knowledge of ML, construction of “normal” baseline||Yes|
|Statistical||Analysis based on statistical model inputs||Knowledge of statistical methods||Yes|
It’s important to note that none of these approaches have been widely tested or adopted. Before deciding to use a particular approach, it’s important to evaluate the pros and cons, develop a methodology, perform a pilot project, and—assuming the pilot goes well—formulate an adoption and rollout plan. I’ve only scratched the surface with these four approaches. In subsequent posts, I’ll discuss each method in more detail.