Most of the focus of the DQ profiling commercial tools is around cleansing “Dimension” data (as referred in Data Warehouse terminology). However, the quality of facts is almost as important as dimensions. In this article I want to suggest a heuristic for identifying outliers in fact data which is infinitely better than nothing! – Anything is infinitely better than nothing, mathematically speaking.
The problem is called, abnormality detection. To be more specific, I am talking about outlier detection. What does that mean to the user?
Example: Real-estate data. There have been reports of an American citizen receiving a $200,000 tax bill for its 3-4 bedroom house in an average suburb. I couldn’t find the original article, so you should trust me on this. If you don’t want to trust me (which is the right thing to do) imagine similar problems that I am sure you have encountered.
The tax man has clearly issued an outlier for the specific sub-class of houses, e.g. a 3-4 bedroom in an average suburb. For such house, the tax should be something between $700-$2000. A $200K tax is a significant number obviously, but a good application should point out the outlier even if the tax is slightly out of order, e.g. $2500 for a house which should be taxed a bit less in that range.
Solution: Write a little algorithm, that learns the distribution of “fact” values in regards to the condition over several other dimensions. Excuse me for using the Data Warehouse methodology (Fact, and Dimension) instead of the usual machine learning methodology (e.g. features). I think the DW methodology makes more sense here, and I don’t want to justify it, so go ahead and replace the terms with your favorite ones.
Algorithm:
1 – Discover facts, and dimensions. From data-type and their distribution, i.e. count(distinct(fact)) / count(fact) is close to 1. Alternatively, you can ask user to identify this.
2 – Filter out time and dates, as they don’t help us much in this setting. A recurring date dimension, like DayOfWeek, or IsHoliday, can be very useful though.
3 – For every Fact do:
3 – a ) For every Dimension do:
3 – a – i ) Measure the statistical distribution of filtered data limited with the dimension. Specifically measure Count, Mean, Variance, Min, and Max.
3 – a – ii ) Store the above statistics in a file along with the selected dimensions, IF Count > 30 and the sqrt(variance) << max – min.
3 – a – iii ) Recurse to (3 – a) and include more dimensions in the condition.
4 – Print out the rules discovered in 3 – a – ii.
Above algorithms is not optimized for performance, but in the case of DQ, who cares about performance? Just run in your Hadoop cluster :).
Output of the above algorithm is a set of rules like: For tax: 3-bedroom, and house, and average suburb, variance is … and mean is …, which means the values are expected to be between 700, and 1500. Your application can read these rules and apply it the data, or user interface to help users fix/avoid their outliers.
Evaluation: Can’t publish customer’s data, but I’ll do it on some public data, later. Only if people ask.