Bad Data, Bad Decisions

| Print |

Thursday, 31 March 2011 14:15

Part 2

In last month’s article I outlined how bad data can lead to bad decision making and what the typical sources of poor data are. This time I will deal with various methods of prevention and correction.

Prevention

I grew up accepting the well-established principle that an ounce of prevention is worth a pound of cure. And so I am proposing here that the best way to prevent bad data is to never collect it in the first place. And the best way to do that, at least on electronic forms, is to use programmatic data validation routines to catch obvious errors before they ever hit your database.

Such routines work by examining each field on a form before it is submitted and using program logic to establish a reasonability test for each specific piece of data.

For example a typical validation routine for an email address might check to make sure that:

There are more sophisticated validation checks that can be performed on an email address, but it is important to check for bad entries for almost every field you collect from an electronic form. By detecting an invalid field at the time of entry you can prompt the user to correct their entries and prevent them from submitting their information until they have made the necessary corrections.

Data Correction

Of course, not all bad data can be detected at the source. For example, consider the question: Are Jerry L. Jonson of 16 Clarke St., Altuna, PA, and Gerry L. Johnson of 16 Clark Street, Altoona, Penn., the same person? Well, most likely they are but there is no way to detect this using the type of validation technique described above. Most likely, this type of bad data already resides in your database and must be corrected after the fact. This is generally referred to as data cleansing or data scrubbing.

In the early days of computing, most data scrubbing was done by hand. And when performed by bleary-eyed humans, the laborious task of finding and then fixing or eliminating incorrect, incomplete or duplicated records was costly - and it often led to the introduction of new errors. Now, specialized software tools use sophisticated algorithms to parse, standardize, correct, match and consolidate data. Their functions range from simple cleansing and enhancement of single sets of data to matching, correcting and consolidating database entries from different databases and file systems.

Summary

In conclusion, there is no single point at which bad data can be detected or eliminated. Any system of data collection and storage requires that you put in place safeguards to both prevent the introduction of invalid data at the time of collection and to inspect and analyze stores of existing data in order to ensure its greatest possible relevance and validity.

~Steven Barnes

Resources:

www.fourguysfromrolla.com

www.computerworld.com