An easy but effective Data Quality algorithm for abnormality detection in real numbers

Most of the focus of the DQ profiling commercial tools is around cleansing “Dimension” data (as referred in Data Warehouse terminology). However, the quality of facts is almost as important as dimensions. In this article I want to suggest a heuristic for identifying outliers in fact data which is infinitely better than nothing! – Anything is infinitely better than nothing, mathematically speaking.

The problem is called, abnormality detection. To be more specific, I am talking about outlier detection. What does that mean to the user?

Example: Real-estate data. There have been reports of an American citizen receiving a $200,000 tax bill for its 3-4 bedroom house in an average suburb. I couldn’t find the original article, so you should trust me on this. If you don’t want to trust me (which is the right thing to do) imagine similar problems that I am sure you have encountered.

The tax man has clearly issued an outlier for the specific sub-class of houses, e.g. a 3-4 bedroom in an average suburb. For such house, the tax should be something between $700-$2000. A $200K tax is a significant number obviously, but a good application should point out the outlier even if the tax is slightly out of order, e.g. $2500 for a house which should be taxed a bit less in that range.

Solution: Write a little algorithm, that learns the distribution of “fact” values in regards to the condition over several other dimensions. Excuse me for using the Data Warehouse methodology (Fact, and Dimension) instead of the usual machine learning methodology (e.g. features). I think the DW methodology makes more sense here, and I don’t want to justify it, so go ahead and replace the terms with your favorite ones.

Algorithm:

1 – Discover facts, and dimensions. From data-type and their distribution, i.e. count(distinct(fact)) / count(fact) is close to 1. Alternatively, you can ask user to identify this.
2 – Filter out time and dates, as they don’t help us much in this setting. A recurring date dimension, like DayOfWeek, or IsHoliday, can be very useful though.
3 – For every Fact do:
3 – a ) For every Dimension do:
3 – a – i ) Measure the statistical distribution of filtered data limited with the dimension. Specifically measure Count, Mean, Variance, Min, and Max.
3 – a – ii ) Store the above statistics in a file along with the selected dimensions, IF Count > 30 and the sqrt(variance) << max – min.
3 – a – iii ) Recurse to (3 – a) and include more dimensions in the condition.
4 – Print out the rules discovered in 3 – a – ii.

Above algorithms is not optimized for performance, but in the case of DQ, who cares about performance? Just run in your Hadoop cluster :).

Output of the above algorithm is a set of rules like: For tax: 3-bedroom, and house, and average suburb, variance is … and mean is …, which means the values are expected to be between 700, and 1500. Your application can read these rules and apply it the data, or user interface to help users fix/avoid their outliers.

Evaluation: Can’t publish customer’s data, but I’ll do it on some public data, later. Only if people ask.

 

Advertisement

A gesture recognition database for Kinnect!

I was at a demo for Kinnect SDK yesterday. I was thrilled with the technology. Kinnect is a wonderful piece of technology putting together a lot of cut edge technologies and pretty amazing algorithms. The infrared camera does a real magic and picks up a 3D image! A good estimation of depth for each pixel.
If you have ever spent time on image processing and finding blobs or objects using pattern recognition techniques, you know what I mean. Having such an accurate estimation of depth for each pixel can revolutionize any object recognition algorithm.
After the session I was deep thinking how can I do something quick and useful with kinnect! Of course every one else should have been doing the same, but as kinnect is designed for gaming, people tent to do some fun cool stuff with it. However, I am not a bug fan of computer games, so I thought I should do something different.

First thing came into my mind was of course using the depth camera for 3D object recognition and do a robot that moves around the house with a kinnect connected to the laptop. This could be an interesting idea, but sure it would be too much work and no real gain, except a bit of self satisfaction.
My second thought was to do something with the gesture recognition. Like having a cloud based gesture database and a bunch of algorithms to realize a gesture. Yes, this idea sounds promising for a bunch of reasons: First, it does not involve any screw driver (which I hate). Second, it is a database. Third, if it is done properly, people may use it. Third, I don’t actually need a kinnect for 98% of the work! (UPDATE: Check out this amazing free robot simulator by Microsoft and the channel 9 talk, where you can do all robotic things without having kinnect or using any screw or wire.)

Gesture Recognition:
——————————
Gesture recognition techniques are as old as computers, there has been enormous amount of research on it. This IEEE Transactions survey paper is a short survey about the most useful and handy techniques of gesture recognition. Although, there is a big body of knowledge there in the scientific community, I am not going to go that way. So don’t panic. First of all I am not interested to do a research degree on computer vision, and second I prefer a different look to the problem. A developer’s look.

Why not existing gesture recognition techniques:
———————————————————————–
The reason I think I would not need to adopt existing techniques is that, these scientists, did not have kinnect, so they could not look at the problem the way we can. Second, scientists deliberately make the problem artificially harder to get a better publication. We want to make the problem easier to get the app working.

Why not machine learning?
—————————————
Don’t take me wrong. Machine learning is amazing, but not for a typical development project for a variety of reasons: First, machine learning is really hard to implement well, even though there are a few projects that can be utilized for this purpose (like google’s prediction API [http://code.google.com/apis/predict/]), it is still very hard to get something useful out of it. Second, how would you test you project that is depended to some machine learning? Third, how could you debug or fine tune your solution? You will be very limited to machine’s suggestion and believe me, machine’s suggestions can sometime really piss you off!

So what?
————
So lets plan a good approach for a gesture database. I would like to have a simple gesture database that works reasonably well at most of the times. Not planning to beat any existing techniques, but I need to work within the following constraints:
1- Gesture data should be recorded in a Sql Server.
2- Finding a gesture between tens of thousands of gestures should happen instantly. This infers some sort of index should exist.
3- A developer should be able to define a gesture purely using code (or strings). Necessary for testing your solution.
4- No crazy algorithm. Beauty is in simplicity.
5- You don’t need to have kinnect to work with the product.

My suggestion:
——————————

I have already got a suggestion for solving this problem which is not tested but I think can work. Let’s build up a bit of theory around the story.
What is a gesture? Gesture is a bunch of movements of a bunch of body elements which infer a meaning.
What is a gesture but make it simpler? Gesture is pattern of movement of a bunch of point in a timely fashion starting from a starting point.
What is a geture? Please make in simple. Gesture starts with a pose. Then each point of the pose animate in a pattern.
What is a pose? If I am standing with my hands up, without moving, I am in a pose called Hands Up!.
Ok! Then how you make a gesture out of a pose? If I am in a pose (e.g. hands up) and start moving my hand in a left-right pattern for a while I am having a waving gesture.
Good! Now I understand gesture.

Let us assume a pose is location of a bunch of points in space. We define some basic terms for further use:
Space: Let us assume space as the area that a man body fits with open hands. This is of course relative to the body but we normalize that area for every person into a cube.

Pose: A collection of named points (like hand, head, elbow, knee, etc.) in space. The point may be defined with absolute positions or with relative positions from each other.

Perspective: This is a key point for the rest of this approach. A pose can have multiple perspectives. Each perspective is mapping (or reflection) of the pose on one plate of the space. For example one perspective may show a flattened 2D image of a 3D body by correlating each point from the space to the plate. We need three 2D perspectives from a pose:

Top perspective (looking to the person from above).
Left perspective (looking to the person from left) and,
Front perspective (looking to the person from front).

For each 2D perspective, we generate two 1D perspectives:
Horizontal perspective (x position of points), and
Vertical perspective (y position of point)

Using above perspectives a shooting man pose can be expressed with a 6 1D perspectives. (2 1D perspective time three 2D perspectives).
Each 1D perspective can be described with a simple string:
Shooting person:
Perspective: Left.vertical = “F.H1.H2”
Perspective: Left.horizontal = “H1.H2-F”

Let us use the following conventions:
F stands for “Face”
H1 stands for “right hand”
H2 stands for “left hand”
period(.) stands for very short distance
dash(-) stands for average distance
underline(_) stands for long distance

Hence, Left.vertical perspective of the shooting man says: There is a right hand very close to left hand then a bit further there is a head (if you look from the above).
Left.horizontal perspective of the shooting man says: There is left hand, right hand, and a face around the same point if you are looking from the front.

Bingo! We have a language to define a pose!

Let’s assume that the language is good and the string can be really used to define a pose. Of course it is not tested yet but I would be really thankful if someone experiments with a few real pictures.

Now that we can define poses, how can we find a similar pose from a database quickly?
Assuming the pose language is useful, this is not an issue at all. The problem is similar to fuzzy string matching problem. In fact a pose is a string, and you may want to find other string approximately similar to this one (for example, if the position of left and right hands of the shooting person is different it is still a shooting position).
It is not really a hard problem, trust me. It can be used using basic q-gram technique. The only difference is that we may need to have a specific equality comparison logic which can be easily implemented with a Clr type definition:
Here is the equality comparison logic:
H1.H2 is equal to H2.H1 (if two points are very close together we can use them interchangeably)
H2-H1 is not equal to H2-h1 (if the points are distant enough, they can’t be interchangeable)
User may define options for equality of objects, for example if the left hand and right hand does not matter to us, we can say Options.LeftAndRightHandsAreEqual!

I think it is enough for now. Using above technique we can have a pose database and a fast method to search a given pose within millions of poses and find other similar poses in a matter of milliseconds. We can do this for all the perspectives and shortlist the intersections for a much better search quality.
The next challenge (apart implementing and testing the technique) is to define the gesture after the pose.

Ever wondered how text search works? Let’s do it!

You have certainly seen the fuzzy text searches everywhere. For example you type “stck” but you actually mean “stack”! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. “q” in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. “Stack” will be broken into a set of {“st”, “ta”, “ac”, “ck”} or “database” will be broken into {“da”,”at”,”ta”,”ba”,”as”,”se”}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped “stck”, any search for “stck” will not match “stack” because “a” is missing, but the 2-gram set {“st”,”tc”,”ck”} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of “star” so we can easily suggest the user that he meant to type: first “stack” or second, “star”.

Now let’s implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i&lt;5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word “stack” which will be broken to: {‘st’,’ta’,’ac’,’ck’} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

Freebase

Freebase is by far the most flexible database I have heard of. It is a very cool concept indeed, and it allows amazing discoveries. It also provides a free sevice for querying and inserting data.
Freebase schema is this: Domain, Type, and property. That’s it. And with this little schema you can have everything, absoloutly everything. Ofcouse flexibility comes with a big price tag on cost of managing stuff that we take for granted in our rdbmses. For example assurance of the integrity of data sounds to be very challenging. This doesnt make freebase style the scema of choice for enterprise, but wait a minte! Really interprises can not benefit from such database? Sure they can and I argue it is a necessity for any big enterprise.
I keep it short for today and will revisit the topic later.

DBLP for SQL Server

DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free.  It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.

I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.

Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.

It basically converts dblp’s cml to flat csv’s that you can import to your DBMS. A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.

(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)

To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.

You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
>dblp_xml2csv.exe dblp-data.xml

You can replace dblp-data.xml with the path of the dblp xml file.

There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.

Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.

Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.

Sampling Data with SQL Server

Larson et al. from Microsoft Research have this paper in ACM SIGMOD 2007. They introduce an interesting sampling mechanism over SQL Server using materialized views. Materialized views are the same as indexed views. If you create an indexed view over a table or a bunch of tables joined together, using their technique, you have a tidy sample for your data which is kept up-to-date automatically.

What caught my eyes was that their base sampling approach is interestingly simple:


CREATE VIEW Sample_TheTableName WITH SCHEMA_BINDING AS
  SELECT ...,
        __RAND = (CAST( ROW_NUMBER()  OVER (ORDER BY RAND()) AS tinyint)
  FROM ...

The additional __RAND column gives a random distribution on data which can give us a statistically valid sample.

Dive into Expression Trees

I am in the process of developing a framework for sampling query results from LINQ queries. It is going to be hard work, you may ask WTF in this world someone need sampling query results from LINQ queries? Well, this can be particularly useful if you are developing a new database that purely works with LINQ or you are having heavy data services that use various data sources then you want to sample the query result.

Unfortunately, I am not currently working on any core database system or heavy duty data services (at work), but I have to do this as part of my study. This project which is going to be an open source project called [I don’t know yet] is supposed to do the job of evaluating my algorithms that I hopefully (fingers crossed) will publish at VLDB 2011.

So if you continue to visit this blog, you will see posts about these topics:

  • LINQ Expression Trees
  • Writing Provider for LINQ
  • Sampling Techniques
  • Data quality metrics
  • and, Rule based profiling

 

Sampling data quality in distributed SQL Servers (Part 1)

Before going forward with this post, I feel I should say why sampling is so important and why sample data quality? First of all, sampling is heavily used in various query optimization techniques. The very key thing a query optimizer needs to know is selectivity of the query. Selectivity means what are the number of results after running the query against a table. This is extremely important to decide the join direction and wrong estimation on selectivity can change the actual runtime of the query massively. For example when joining three  huge tables AxBxC together, which B is the many to many relation table, the query can be planned as (AxB)xC or Ax(BxC). You may say what is the difference? but imagine the query returns only one row from table C and 1 million rows from A. Which direction do you use? Of course Ax(BxC) ensures a million less lookups.

Sampling is an expensive way to estimate selectivity, because the query should run anyway (but over a much smaller set) but could be the simplest or sometimes the only option. Other ways include complex statistical modeling and collecting forms of statistics about data like histograms.

Although in the context of a single database histograms seem more appealing, in distributed databases, they are not god for several reasons that is out of the scope of this post.

I am personally more interested in distributed databases, not as talked in literature (or federated databases) but in a more practical collaborative enterprise system.