Kaggle is good start, but needs to be smarter

Kaglle.com is a website that hosts competitions for data scientists. Regardless of what data scientist means, and where is the line between data scientist (who invents techniques and algorithms) and data analyst (who uses existing tools and techniques to mine knowledge from data), the site is a good start.

I understand the business idea behind kaggle is partly collecting anonymous data analysts to solve data problems for enterprise. In return, individuals get rewards and businesses get their problem solved. Excellent win-win idea, but not so well implemented.

Personally, I tried to engage in the competitions, but I had a hard time motivating myself. Like (possibly) most users of the site, I was not there to win a competition, it would be great if that happens, but what would turn me on was learning from others. Unfortunately, kaggle did not put specific thoughts to promote collaboration. I was hoping to team up with experts, but I faced a real competition environment. Minimum openness and collaboration. This is a good model for sports competitions, and maybe the 1% whom are good enough and just want to win a competition and get some recognition, but wastes all the talents that can do good data analysis but don’t have the skill or will to win a competition.

The market for data analysis is huge, and someone has to start winning it. An average result from the data analysis is actually as valuable as the best possible result. Remember, the renowned Netflix competition, after 2 years and thousand of competitors, the estimation was only 10% better. Let’s face it. World’s top data analysts wouldn’t care about $2,000 prize or a free trip to a conference. Young and fresh data analysts also won’t get enough learning in Kaggle. Data analysis is not as rewarding as hacking is for teenagers, they should understand math, statistics, and maybe algorithms, so hacking competition style doesn’t work for data analysis.

I believe Kaggle is not going to win any significant share in the available market for crowd sourced data analysis. Another start-up with a better approach targeting average data analysts (or scientists if you like) is going to shine sooner or later.

Advertisement

A gesture recognition database for Kinnect!

I was at a demo for Kinnect SDK yesterday. I was thrilled with the technology. Kinnect is a wonderful piece of technology putting together a lot of cut edge technologies and pretty amazing algorithms. The infrared camera does a real magic and picks up a 3D image! A good estimation of depth for each pixel.
If you have ever spent time on image processing and finding blobs or objects using pattern recognition techniques, you know what I mean. Having such an accurate estimation of depth for each pixel can revolutionize any object recognition algorithm.
After the session I was deep thinking how can I do something quick and useful with kinnect! Of course every one else should have been doing the same, but as kinnect is designed for gaming, people tent to do some fun cool stuff with it. However, I am not a bug fan of computer games, so I thought I should do something different.

First thing came into my mind was of course using the depth camera for 3D object recognition and do a robot that moves around the house with a kinnect connected to the laptop. This could be an interesting idea, but sure it would be too much work and no real gain, except a bit of self satisfaction.
My second thought was to do something with the gesture recognition. Like having a cloud based gesture database and a bunch of algorithms to realize a gesture. Yes, this idea sounds promising for a bunch of reasons: First, it does not involve any screw driver (which I hate). Second, it is a database. Third, if it is done properly, people may use it. Third, I don’t actually need a kinnect for 98% of the work! (UPDATE: Check out this amazing free robot simulator by Microsoft and the channel 9 talk, where you can do all robotic things without having kinnect or using any screw or wire.)

Gesture Recognition:
——————————
Gesture recognition techniques are as old as computers, there has been enormous amount of research on it. This IEEE Transactions survey paper is a short survey about the most useful and handy techniques of gesture recognition. Although, there is a big body of knowledge there in the scientific community, I am not going to go that way. So don’t panic. First of all I am not interested to do a research degree on computer vision, and second I prefer a different look to the problem. A developer’s look.

Why not existing gesture recognition techniques:
———————————————————————–
The reason I think I would not need to adopt existing techniques is that, these scientists, did not have kinnect, so they could not look at the problem the way we can. Second, scientists deliberately make the problem artificially harder to get a better publication. We want to make the problem easier to get the app working.

Why not machine learning?
—————————————
Don’t take me wrong. Machine learning is amazing, but not for a typical development project for a variety of reasons: First, machine learning is really hard to implement well, even though there are a few projects that can be utilized for this purpose (like google’s prediction API [http://code.google.com/apis/predict/]), it is still very hard to get something useful out of it. Second, how would you test you project that is depended to some machine learning? Third, how could you debug or fine tune your solution? You will be very limited to machine’s suggestion and believe me, machine’s suggestions can sometime really piss you off!

So what?
————
So lets plan a good approach for a gesture database. I would like to have a simple gesture database that works reasonably well at most of the times. Not planning to beat any existing techniques, but I need to work within the following constraints:
1- Gesture data should be recorded in a Sql Server.
2- Finding a gesture between tens of thousands of gestures should happen instantly. This infers some sort of index should exist.
3- A developer should be able to define a gesture purely using code (or strings). Necessary for testing your solution.
4- No crazy algorithm. Beauty is in simplicity.
5- You don’t need to have kinnect to work with the product.

My suggestion:
——————————

I have already got a suggestion for solving this problem which is not tested but I think can work. Let’s build up a bit of theory around the story.
What is a gesture? Gesture is a bunch of movements of a bunch of body elements which infer a meaning.
What is a gesture but make it simpler? Gesture is pattern of movement of a bunch of point in a timely fashion starting from a starting point.
What is a geture? Please make in simple. Gesture starts with a pose. Then each point of the pose animate in a pattern.
What is a pose? If I am standing with my hands up, without moving, I am in a pose called Hands Up!.
Ok! Then how you make a gesture out of a pose? If I am in a pose (e.g. hands up) and start moving my hand in a left-right pattern for a while I am having a waving gesture.
Good! Now I understand gesture.

Let us assume a pose is location of a bunch of points in space. We define some basic terms for further use:
Space: Let us assume space as the area that a man body fits with open hands. This is of course relative to the body but we normalize that area for every person into a cube.

Pose: A collection of named points (like hand, head, elbow, knee, etc.) in space. The point may be defined with absolute positions or with relative positions from each other.

Perspective: This is a key point for the rest of this approach. A pose can have multiple perspectives. Each perspective is mapping (or reflection) of the pose on one plate of the space. For example one perspective may show a flattened 2D image of a 3D body by correlating each point from the space to the plate. We need three 2D perspectives from a pose:

Top perspective (looking to the person from above).
Left perspective (looking to the person from left) and,
Front perspective (looking to the person from front).

For each 2D perspective, we generate two 1D perspectives:
Horizontal perspective (x position of points), and
Vertical perspective (y position of point)

Using above perspectives a shooting man pose can be expressed with a 6 1D perspectives. (2 1D perspective time three 2D perspectives).
Each 1D perspective can be described with a simple string:
Shooting person:
Perspective: Left.vertical = “F.H1.H2”
Perspective: Left.horizontal = “H1.H2-F”

Let us use the following conventions:
F stands for “Face”
H1 stands for “right hand”
H2 stands for “left hand”
period(.) stands for very short distance
dash(-) stands for average distance
underline(_) stands for long distance

Hence, Left.vertical perspective of the shooting man says: There is a right hand very close to left hand then a bit further there is a head (if you look from the above).
Left.horizontal perspective of the shooting man says: There is left hand, right hand, and a face around the same point if you are looking from the front.

Bingo! We have a language to define a pose!

Let’s assume that the language is good and the string can be really used to define a pose. Of course it is not tested yet but I would be really thankful if someone experiments with a few real pictures.

Now that we can define poses, how can we find a similar pose from a database quickly?
Assuming the pose language is useful, this is not an issue at all. The problem is similar to fuzzy string matching problem. In fact a pose is a string, and you may want to find other string approximately similar to this one (for example, if the position of left and right hands of the shooting person is different it is still a shooting position).
It is not really a hard problem, trust me. It can be used using basic q-gram technique. The only difference is that we may need to have a specific equality comparison logic which can be easily implemented with a Clr type definition:
Here is the equality comparison logic:
H1.H2 is equal to H2.H1 (if two points are very close together we can use them interchangeably)
H2-H1 is not equal to H2-h1 (if the points are distant enough, they can’t be interchangeable)
User may define options for equality of objects, for example if the left hand and right hand does not matter to us, we can say Options.LeftAndRightHandsAreEqual!

I think it is enough for now. Using above technique we can have a pose database and a fast method to search a given pose within millions of poses and find other similar poses in a matter of milliseconds. We can do this for all the perspectives and shortlist the intersections for a much better search quality.
The next challenge (apart implementing and testing the technique) is to define the gesture after the pose.

Ever wondered how text search works? Let’s do it!

You have certainly seen the fuzzy text searches everywhere. For example you type “stck” but you actually mean “stack”! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. “q” in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. “Stack” will be broken into a set of {“st”, “ta”, “ac”, “ck”} or “database” will be broken into {“da”,”at”,”ta”,”ba”,”as”,”se”}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped “stck”, any search for “stck” will not match “stack” because “a” is missing, but the 2-gram set {“st”,”tc”,”ck”} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of “star” so we can easily suggest the user that he meant to type: first “stack” or second, “star”.

Now let’s implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word “stack” which will be broken to: {‘st’,’ta’,’ac’,’ck’} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

Freebase

Freebase is by far the most flexible database I have heard of. It is a very cool concept indeed, and it allows amazing discoveries. It also provides a free sevice for querying and inserting data.
Freebase schema is this: Domain, Type, and property. That’s it. And with this little schema you can have everything, absoloutly everything. Ofcouse flexibility comes with a big price tag on cost of managing stuff that we take for granted in our rdbmses. For example assurance of the integrity of data sounds to be very challenging. This doesnt make freebase style the scema of choice for enterprise, but wait a minte! Really interprises can not benefit from such database? Sure they can and I argue it is a necessity for any big enterprise.
I keep it short for today and will revisit the topic later.

Essentials of setting up a good performing production Sql Server instance

To set up a Sql Server in production, it is better to get it right for the first time. Obviously, you can’t plan for the best performance when you don’t know how exactly your data will grow, but there are a few key points that won’t change regardless of your data growth pattern. You better get them right from the first day. Two components of a typical sql server system need special attention: First is the log file, and second is the tempdb.

Sql Server writes any transaction on the log file instantly as it happens, it then writes them back to pages. Note that pages might not be written back to disk for a while. If server crashes before writing down a page, you can recover from the log file. Tempdb is also very important bacause sql server does all its internal jobs on tempdb. Stuff like index or table spooling, row versioning, etc all happen at tempdb. Below are the list of cautions you may take to architect a good performing mid size sql server database: (and by mid size I mean anything above 1GB. For larger databases -say above 20GB- you also have to think about partitioning, filegroups, etc as minimum).

1- Keep the log file in a sperate dedicated and fast drive. If you are using a virtual host make sure to get a separate physical drive for ONLY your log file.

2- Run sql service using a specificly created account not the generic service or network accounts. Make sure no one (specially service accounts) has any read or write access to the hard drive with the log file. This will asure that the disk head will not mock around, so that Sql Server can write its log files using the advertised hard disk write speed.

3- Use log shipping and log based replication with caution, because they read from the log file, hence they may make hard disk’s head to dance instead of running.

4- Put the tempdb on a very fast har drive. Possibly an SSD or at least make sure its hard drive is seperated from the database file, log file, and even operating system’s drive.

5- Make sure you have enough RAM. RAMs are cheap but their effect on performance is more than anything. Run a task that monitors the “SqlServer – BufferManagement: Page Life Expectancy” counter periodically, and add more RAM as the value starts heading south.

Obviously, above considerations will not make your slow dead database to rock and roll but it will assure you that effect of the environmental on performance is minimum. Then you can use your smart Sql tachniques to make the db fast and performing.

Where should I write join conditions? In the ON clause or in the WHERE clause?

What is the difference between these two statements?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

And the following query?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = 'V8'

As you can see, both above queries return a join of cars and resellers for V8 cars. And obviously both queries will return same results. But does it mean that there is no difference between these two way of limiting the results by conditions?

Although for inner join you can use both approaches interchangeably, for outer join a subtle difference can catch you. What would you expect to be the outcome of the following query?


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

If you expect to get all the cars with V8 engine left joined to the resellers, you won’t be happy with the result because if there is a car that is not a V8 but has no resellers, will be reflected in the results! You have to move c.Engine=’V8′ to the WHERE clause to guarantee the correct behaviour..


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = "V8"

Reason for this behaviour is that Sql Server will first apply the outer join predicates and then reverts rows that have no right side representation. To understand the condition better, think about the way you write a left join in Linq.


var res = from c in Cars
join rs in Resellers on c.CarId equals rs.CarId
from r in rs.DefaultIfEmpty()
select ...;

Left join in Sql server is also calculated the same way. In the above example you first run the join between Cars and Resellers, then select Null when the resellers collection is empty for the join (i.e. rs.DefaultIfEmpty()).

Top K Query Processing, or FULL SCAN IS THE DEVIL! avoid it

Almost always, users wouldn’t want to see all the results coming back from query. The query result set is usually restricted by conditions and first (or first few) page(s) of response is all user wants to see. If you ask me to describe db optimization in a sentence, I would say the sentence is: avoid full table/index scans. This is the single key to highly performing db applications. However, avoiding full scans is not trivial at all. In fact, it is a hard problem. If your table has n records, full table scan has complexity of O(n). A join between two tables with full scans will have the complexity of O(n^2) and so on. It can get out of control pretty easily.
Have you been in a situation that a tiny 10GB database with just a few million records takes 10 minutes to respond to a query?
Ever wondered how google returns search results from its super massive tables for millions of user on eack keystroke? No matter how expensive your hardware is and how much processing power you have, your machine will be on its knees when a complex query full scans massive tables for a bunch of queries. You have to know your data very well, and you have to know your users (or use cases) very well. Do whatever to avoid full scans on large data sets specially if the query is run frequently. Obviously there is no problem to run a query that takes 5 minutes once every weekend but never a frequently used query should take more that a few seconds. There are heaps you can do to avoid full scans and it is well covered under query processing resources. Read this book to learn some techniques to do such stuff in Ms Sql Server 2008.

Despite all the powerfull tools and smartness that SQL Server or any other commercial DBMS provides, there are cases that top k query processing is not possible at all with just writing SQL. That is when you need to know about the algorithms. Yes! some computer science fun! Top k query processing is a widely studied subject, and there are heaps of methods and techniques that can super drastrically improve the application performance! Unfortunately most comercial DBMSs do not support even easiest top k query processing techniques and algorithms. Hence, you have to code them yourself if you decided to use them. A few top k query processing techniques like TA and PRA are quiet quick and easy to develop but some others are harder and much more complex. This 2008 Acm Survey is an excellent source to gain good understanding of top k query processing thanks to the guys at Waterloo university, ON, Canada.

To be precise. my message on this post is that there is no excuse for long running queries!

Are you a super smart senior consultant? See if you can pass my interview.

I have been looking in the market to change my job in last few weeks and have faced very interesting stuff that is worth writing a book, at least a short story about.

I have seen a christian who wanted to develop the ultimate database inspired from Jesus, have been to interviews where the company has not actually had a position yet, and I have faced people who think are super smart senior consultants and know how to avoid an object being finalized in garbage collector.

If you are one of those super smart senior consultants who knows everything see if you can answer this very basic classic database question. If you can’t you should seriously rethink about your title. Seriously, I can’t  believe someone developing business application for more than a decade and not being able to solve this problem in an interview:

Q. We have a couple of tables in a database. First table has column X and second table has column Y. First table has a foreign key relationship to the second table. What do you do to calculate “SELECT TOP 1 * FROM X JOIN Y … ORDER BY a*X+b*Y DESCwithout a full scan on any table; assuming a and b are arbitrary positive numbers provided by user each time? You don’t necessarily have to solve it with only SQL and you are allowed to use C#. Above query is only for defining the problem.

Justification: Why I think this can be a valid interview question? Two reasons: 1) Think how many time you have faced a client complaining from performance and how many time you where in a situation that knowing how to avoid an object from being finalized makes the customer happy? 2) How easy is it to get the knowledge of algorithms for fast data processing on site in contrast to some rare technical terminology? Google avoiding an object from being finalized and you quickly get your answer in a few seconds but I would be interested to know how you use google to solve this problem?

Automatic Mid-tier Cache!

I have been thinking about this for a while. I started to work on a Silverlight project last year which was a nifty little business app. One of those apps that is just sitting down and working, you know, a web server, an average DB with a few million records, and a silverlight client with a bunch of forms, grids, and menus.

Everything was straightforward and brainless, like just follow a pattern and do the job. However, I was deeply dis satisfied from one aspect of the project. The Caching. What I could not accept was that in 21st century, when you have Entity Framework and IQueryable, you should still manually cache your data when it is appropriate and do all the pointless work of cache invalidation and loading, etc. Apart from the pain of working on something that should have been automated, I wouldn’t trust a programmer to decide which parts of the data should be cached and which part shouldn’t be. Not that I don’t believe they can do a good job on that, they don’t have enough information (at the time of dev work) to decide on it.

Caching strategy should be based on user behaviour and is subject to change by passage of time. For example at some stage lots of QLD pharmacies are queried, but next week NSW users decide to get ready for their conferences and start hammering the system for NSW pharmacies.

Le me be clear about my expectations of a caching system. It should have the following charachteristics:

  1. It should know what users are going to query a lot and cache that (and only that) part of the database.
  2. It should be able to re-use the caches. For example if I say 1.”Give me all QLD pharmacies”, and next one says 2.”Give me all QLD Chemists Warehouses”, the cache manager should be smart enough to run this new query 2., over the results of query 1. which has been retrieved a few minutes ago.
  3. It should optimize the indexes for performance based on the user queries.
  4. It should change the cache when user behaviour changes.
  5. It can call back the database only if there is absolutely no way of answering the query from cache.

Above requests seems to be a lot, but not really in 2011. All these methods are possible, in fact DBMSs do those kind of stuff for ages. We also have IQueryable, which makes it even easier to have a decent caching system.

So let me write a few examples:

Q1: Pharmacies.Join( … Address …).Join( … State …).Where( s => s.Sate = “QLD”).Select(…)

Q2: Pharmacies.Join( … Address …).Join( … State …).Where( a =>a.Sate = “QLD” && a.PostCode > 4000 && a.PostCode<4079).Select(…)

Q3: Pharmacies.Join( … Address …).Join( … State …).Where( s => s.Sate = “QLD”).GroupBy(…).Where( pg => pg.Count() > 4).Select(…)

Q4:  PharmacyStaff.Where(ps => ps.Position == “Manager”).Select(…)

Q5: Pharmacies.Join( … Address …).Join( … State …).Join(…PharmacyStaff…).Where( s => s.Sate=”QLD” && s.Position == “Manager” ).Select(…)

Users login to our system  and do stuff that will cause the above queries to be issued. Normally they will all be issued against the database, but it means that our caching strategy is stupid as a donkey. Really what I would expect is that only Q1 and Q4 are ran against the database. Q2, Q3, and Q4 are all subsets of Q1, hence if we already have those results from Q1, such a waste to run these new queries against the database. Why not look at the Expression Tree and figure out that Q2 is forms a query which is a subset of Q1. Then change the queries as below:

Q1: not changed…

Q2: Q1.Where(a => a.PostCode > 4000 && a. PostCode<4079).Select(…)

Q3: Q1.GroupBy(…).Where(…)

Q4: not changed…

Q5: Q1.Join(…Q4…).Select(…)

Check out the above queries. Aren’t they much better. We don’t expect user or programmer to waste his time on translating those queries. The caching system should do that. It should be an IQueryable that reads the ExpressionTree and translates it into a new ExpressionTree that uses existing data in the cache if there is no need to query database.

This specially make  sense in CLOUD, where you have to pay for querying your SQL Asure.

Enough talking about the dreams, lets become realistic! I did a bit of research and as I expected no such caching manager exists (if you know some tell me and save my hair). So I decided to do it myself. Check the Auto-mid-tier-cache project which I have already started. I haven’t gone far with it yet. It is just a proof on concepts and it implements no IQueryable. It uses a set of objects defined by myself for Relational Algebra operators. It does the very basic of view-matching to find what query is subset of what other, and it is able to translates queries to run against the database or cache alternatively.

I ran it and it worked fine and a bunch of benchmark analysis proved its effectiveness. What is left now is to complete the view-matching and write an IQueryable on top of it. Lot of work but it is worth it.

I forgot to say that you can limit the cache size by setting cost upper-bound. Next issue is that it does not keep itself up-to-date, but this is really another story.

Column Store Databases

Most RDBMS systems that are common, i.e. SQL Server, Oracle, etc. store data in rows. Indeed a bunch of rows form a page and a page is virtually the unit block that can be loaded into memory. This model is helpful when large amount of data should be store on disk. Due to slow speed operation of disk you want to keep related information physically as close as possible.

This also has a dark side to it by making redundancy in data. One obvious thing is that most of the data that we use comes from a limited domain, for example list of cities, post codes, countries. Names, area codes, etc. We usually don’t care about storage much since storage is the cheapest thing to get these days, yet it comes with the price of performance.

If you want to count the number of distinct countries in your database of 10,000,000 records, you need to pass through the whole 10M. Then if you are trying to run expensive categorization, ontology extraction, etc on multiple column you have to deal with tons of duplicate values which slow you down.

Also if you want to change schema of a data sets dynamically, you have to deal with extra complexity. That is when column store data stores come to rescue with dramatic performance improvement. Yet there are certain tasks that will be very slow on such data bases.

Column store databases like MonetDB store data domain in column, and maintain relationships as pointers to these data. This is much like traditional way of storing data in memory where you have your actual data objects somewhere but you organise a List<T> as a list of pointers to your data. You just keep one copy of month names and everything else is pointing to that.

UPDAE — MonetDB has an interesting property that it is designed to break the columns into sizes that fit into the CPU cache. The “Memory Wall” is a big problem for most modern DBMSs as random access to large pages of data which do not fit in the CPU cache significantly reduces the performance of the data processing. As column store databases work very well with bulk operations, and optimised set of relational algebra operators called BQA (Bulk Query Algebra).  Bonk’s group at CWI, Amsterdam who have developed MonetDB have an interesting little paper that describes the history and future of their work. I suggest you should take a look at their paper (Download it from google scholar).

Although it provides flexibility and good performance hit in memory, the story turns back on disk. Indeed the performance for getting data from one relation is as if you had the relation joined to another table for every single column of it.

I think column storage is a very good candidate for caching data. I have been playing with this concept in the auto-midtier-cache project a bit, you can also take a look.