An easy but effective Data Quality algorithm for abnormality detection in real numbers

Most of the focus of the DQ profiling commercial tools is around cleansing “Dimension” data (as referred in Data Warehouse terminology). However, the quality of facts is almost as important as dimensions. In this article I want to suggest a heuristic for identifying outliers in fact data which is infinitely better than nothing! – Anything is infinitely better than nothing, mathematically speaking.

The problem is called, abnormality detection. To be more specific, I am talking about outlier detection. What does that mean to the user?

Example: Real-estate data. There have been reports of an American citizen receiving a $200,000 tax bill for its 3-4 bedroom house in an average suburb. I couldn’t find the original article, so you should trust me on this. If you don’t want to trust me (which is the right thing to do) imagine similar problems that I am sure you have encountered.

The tax man has clearly issued an outlier for the specific sub-class of houses, e.g. a 3-4 bedroom in an average suburb. For such house, the tax should be something between $700-$2000. A $200K tax is a significant number obviously, but a good application should point out the outlier even if the tax is slightly out of order, e.g. $2500 for a house which should be taxed a bit less in that range.

Solution: Write a little algorithm, that learns the distribution of “fact” values in regards to the condition over several other dimensions. Excuse me for using the Data Warehouse methodology (Fact, and Dimension) instead of the usual machine learning methodology (e.g. features). I think the DW methodology makes more sense here, and I don’t want to justify it, so go ahead and replace the terms with your favorite ones.

Algorithm:

1 – Discover facts, and dimensions. From data-type and their distribution, i.e. count(distinct(fact)) / count(fact) is close to 1. Alternatively, you can ask user to identify this.
2 – Filter out time and dates, as they don’t help us much in this setting. A recurring date dimension, like DayOfWeek, or IsHoliday, can be very useful though.
3 – For every Fact do:
3 – a ) For every Dimension do:
3 – a – i ) Measure the statistical distribution of filtered data limited with the dimension. Specifically measure Count, Mean, Variance, Min, and Max.
3 – a – ii ) Store the above statistics in a file along with the selected dimensions, IF Count > 30 and the sqrt(variance) << max – min.
3 – a – iii ) Recurse to (3 – a) and include more dimensions in the condition.
4 – Print out the rules discovered in 3 – a – ii.

Above algorithms is not optimized for performance, but in the case of DQ, who cares about performance? Just run in your Hadoop cluster :).

Output of the above algorithm is a set of rules like: For tax: 3-bedroom, and house, and average suburb, variance is … and mean is …, which means the values are expected to be between 700, and 1500. Your application can read these rules and apply it the data, or user interface to help users fix/avoid their outliers.

Evaluation: Can’t publish customer’s data, but I’ll do it on some public data, later. Only if people ask.

 

Kaggle is good start, but needs to be smarter

Kaglle.com is a website that hosts competitions for data scientists. Regardless of what data scientist means, and where is the line between data scientist (who invents techniques and algorithms) and data analyst (who uses existing tools and techniques to mine knowledge from data), the site is a good start.

I understand the business idea behind kaggle is partly collecting anonymous data analysts to solve data problems for enterprise. In return, individuals get rewards and businesses get their problem solved. Excellent win-win idea, but not so well implemented.

Personally, I tried to engage in the competitions, but I had a hard time motivating myself. Like (possibly) most users of the site, I was not there to win a competition, it would be great if that happens, but what would turn me on was learning from others. Unfortunately, kaggle did not put specific thoughts to promote collaboration. I was hoping to team up with experts, but I faced a real competition environment. Minimum openness and collaboration. This is a good model for sports competitions, and maybe the 1% whom are good enough and just want to win a competition and get some recognition, but wastes all the talents that can do good data analysis but don’t have the skill or will to win a competition.

The market for data analysis is huge, and someone has to start winning it. An average result from the data analysis is actually as valuable as the best possible result. Remember, the renowned Netflix competition, after 2 years and thousand of competitors, the estimation was only 10% better. Let’s face it. World’s top data analysts wouldn’t care about $2,000 prize or a free trip to a conference. Young and fresh data analysts also won’t get enough learning in Kaggle. Data analysis is not as rewarding as hacking is for teenagers, they should understand math, statistics, and maybe algorithms, so hacking competition style doesn’t work for data analysis.

I believe Kaggle is not going to win any significant share in the available market for crowd sourced data analysis. Another start-up with a better approach targeting average data analysts (or scientists if you like) is going to shine sooner or later.