Kaggle is good start, but needs to be smarter

Kaglle.com is a website that hosts competitions for data scientists. Regardless of what data scientist means, and where is the line between data scientist (who invents techniques and algorithms) and data analyst (who uses existing tools and techniques to mine knowledge from data), the site is a good start.

I understand the business idea behind kaggle is partly collecting anonymous data analysts to solve data problems for enterprise. In return, individuals get rewards and businesses get their problem solved. Excellent win-win idea, but not so well implemented.

Personally, I tried to engage in the competitions, but I had a hard time motivating myself. Like (possibly) most users of the site, I was not there to win a competition, it would be great if that happens, but what would turn me on was learning from others. Unfortunately, kaggle did not put specific thoughts to promote collaboration. I was hoping to team up with experts, but I faced a real competition environment. Minimum openness and collaboration. This is a good model for sports competitions, and maybe the 1% whom are good enough and just want to win a competition and get some recognition, but wastes all the talents that can do good data analysis but don’t have the skill or will to win a competition.

The market for data analysis is huge, and someone has to start winning it. An average result from the data analysis is actually as valuable as the best possible result. Remember, the renowned Netflix competition, after 2 years and thousand of competitors, the estimation was only 10% better. Let’s face it. World’s top data analysts wouldn’t care about $2,000 prize or a free trip to a conference. Young and fresh data analysts also won’t get enough learning in Kaggle. Data analysis is not as rewarding as hacking is for teenagers, they should understand math, statistics, and maybe algorithms, so hacking competition style doesn’t work for data analysis.

I believe Kaggle is not going to win any significant share in the available market for crowd sourced data analysis. Another start-up with a better approach targeting average data analysts (or scientists if you like) is going to shine sooner or later.

Advertisements

A gesture recognition database for Kinnect!

I was at a demo for Kinnect SDK yesterday. I was thrilled with the technology. Kinnect is a wonderful piece of technology putting together a lot of cut edge technologies and pretty amazing algorithms. The infrared camera does a real magic and picks up a 3D image! A good estimation of depth for each pixel.
If you have ever spent time on image processing and finding blobs or objects using pattern recognition techniques, you know what I mean. Having such an accurate estimation of depth for each pixel can revolutionize any object recognition algorithm.
After the session I was deep thinking how can I do something quick and useful with kinnect! Of course every one else should have been doing the same, but as kinnect is designed for gaming, people tent to do some fun cool stuff with it. However, I am not a bug fan of computer games, so I thought I should do something different.

First thing came into my mind was of course using the depth camera for 3D object recognition and do a robot that moves around the house with a kinnect connected to the laptop. This could be an interesting idea, but sure it would be too much work and no real gain, except a bit of self satisfaction.
My second thought was to do something with the gesture recognition. Like having a cloud based gesture database and a bunch of algorithms to realize a gesture. Yes, this idea sounds promising for a bunch of reasons: First, it does not involve any screw driver (which I hate). Second, it is a database. Third, if it is done properly, people may use it. Third, I don’t actually need a kinnect for 98% of the work! (UPDATE: Check out this amazing free robot simulator by Microsoft and the channel 9 talk, where you can do all robotic things without having kinnect or using any screw or wire.)

Gesture Recognition:
——————————
Gesture recognition techniques are as old as computers, there has been enormous amount of research on it. This IEEE Transactions survey paper is a short survey about the most useful and handy techniques of gesture recognition. Although, there is a big body of knowledge there in the scientific community, I am not going to go that way. So don’t panic. First of all I am not interested to do a research degree on computer vision, and second I prefer a different look to the problem. A developer’s look.

Why not existing gesture recognition techniques:
———————————————————————–
The reason I think I would not need to adopt existing techniques is that, these scientists, did not have kinnect, so they could not look at the problem the way we can. Second, scientists deliberately make the problem artificially harder to get a better publication. We want to make the problem easier to get the app working.

Why not machine learning?
—————————————
Don’t take me wrong. Machine learning is amazing, but not for a typical development project for a variety of reasons: First, machine learning is really hard to implement well, even though there are a few projects that can be utilized for this purpose (like google’s prediction API [http://code.google.com/apis/predict/]), it is still very hard to get something useful out of it. Second, how would you test you project that is depended to some machine learning? Third, how could you debug or fine tune your solution? You will be very limited to machine’s suggestion and believe me, machine’s suggestions can sometime really piss you off!

So what?
————
So lets plan a good approach for a gesture database. I would like to have a simple gesture database that works reasonably well at most of the times. Not planning to beat any existing techniques, but I need to work within the following constraints:
1- Gesture data should be recorded in a Sql Server.
2- Finding a gesture between tens of thousands of gestures should happen instantly. This infers some sort of index should exist.
3- A developer should be able to define a gesture purely using code (or strings). Necessary for testing your solution.
4- No crazy algorithm. Beauty is in simplicity.
5- You don’t need to have kinnect to work with the product.

My suggestion:
——————————

I have already got a suggestion for solving this problem which is not tested but I think can work. Let’s build up a bit of theory around the story.
What is a gesture? Gesture is a bunch of movements of a bunch of body elements which infer a meaning.
What is a gesture but make it simpler? Gesture is pattern of movement of a bunch of point in a timely fashion starting from a starting point.
What is a geture? Please make in simple. Gesture starts with a pose. Then each point of the pose animate in a pattern.
What is a pose? If I am standing with my hands up, without moving, I am in a pose called Hands Up!.
Ok! Then how you make a gesture out of a pose? If I am in a pose (e.g. hands up) and start moving my hand in a left-right pattern for a while I am having a waving gesture.
Good! Now I understand gesture.

Let us assume a pose is location of a bunch of points in space. We define some basic terms for further use:
Space: Let us assume space as the area that a man body fits with open hands. This is of course relative to the body but we normalize that area for every person into a cube.

Pose: A collection of named points (like hand, head, elbow, knee, etc.) in space. The point may be defined with absolute positions or with relative positions from each other.

Perspective: This is a key point for the rest of this approach. A pose can have multiple perspectives. Each perspective is mapping (or reflection) of the pose on one plate of the space. For example one perspective may show a flattened 2D image of a 3D body by correlating each point from the space to the plate. We need three 2D perspectives from a pose:

Top perspective (looking to the person from above).
Left perspective (looking to the person from left) and,
Front perspective (looking to the person from front).

For each 2D perspective, we generate two 1D perspectives:
Horizontal perspective (x position of points), and
Vertical perspective (y position of point)

Using above perspectives a shooting man pose can be expressed with a 6 1D perspectives. (2 1D perspective time three 2D perspectives).
Each 1D perspective can be described with a simple string:
Shooting person:
Perspective: Left.vertical = “F.H1.H2”
Perspective: Left.horizontal = “H1.H2-F”

Let us use the following conventions:
F stands for “Face”
H1 stands for “right hand”
H2 stands for “left hand”
period(.) stands for very short distance
dash(-) stands for average distance
underline(_) stands for long distance

Hence, Left.vertical perspective of the shooting man says: There is a right hand very close to left hand then a bit further there is a head (if you look from the above).
Left.horizontal perspective of the shooting man says: There is left hand, right hand, and a face around the same point if you are looking from the front.

Bingo! We have a language to define a pose!

Let’s assume that the language is good and the string can be really used to define a pose. Of course it is not tested yet but I would be really thankful if someone experiments with a few real pictures.

Now that we can define poses, how can we find a similar pose from a database quickly?
Assuming the pose language is useful, this is not an issue at all. The problem is similar to fuzzy string matching problem. In fact a pose is a string, and you may want to find other string approximately similar to this one (for example, if the position of left and right hands of the shooting person is different it is still a shooting position).
It is not really a hard problem, trust me. It can be used using basic q-gram technique. The only difference is that we may need to have a specific equality comparison logic which can be easily implemented with a Clr type definition:
Here is the equality comparison logic:
H1.H2 is equal to H2.H1 (if two points are very close together we can use them interchangeably)
H2-H1 is not equal to H2-h1 (if the points are distant enough, they can’t be interchangeable)
User may define options for equality of objects, for example if the left hand and right hand does not matter to us, we can say Options.LeftAndRightHandsAreEqual!

I think it is enough for now. Using above technique we can have a pose database and a fast method to search a given pose within millions of poses and find other similar poses in a matter of milliseconds. We can do this for all the perspectives and shortlist the intersections for a much better search quality.
The next challenge (apart implementing and testing the technique) is to define the gesture after the pose.

Ever wondered how text search works? Let’s do it!

You have certainly seen the fuzzy text searches everywhere. For example you type “stck” but you actually mean “stack”! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. “q” in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. “Stack” will be broken into a set of {“st”, “ta”, “ac”, “ck”} or “database” will be broken into {“da”,”at”,”ta”,”ba”,”as”,”se”}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped “stck”, any search for “stck” will not match “stack” because “a” is missing, but the 2-gram set {“st”,”tc”,”ck”} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of “star” so we can easily suggest the user that he meant to type: first “stack” or second, “star”.

Now let’s implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word “stack” which will be broken to: {‘st’,’ta’,’ac’,’ck’} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

Freebase

Freebase is by far the most flexible database I have heard of. It is a very cool concept indeed, and it allows amazing discoveries. It also provides a free sevice for querying and inserting data.
Freebase schema is this: Domain, Type, and property. That’s it. And with this little schema you can have everything, absoloutly everything. Ofcouse flexibility comes with a big price tag on cost of managing stuff that we take for granted in our rdbmses. For example assurance of the integrity of data sounds to be very challenging. This doesnt make freebase style the scema of choice for enterprise, but wait a minte! Really interprises can not benefit from such database? Sure they can and I argue it is a necessity for any big enterprise.
I keep it short for today and will revisit the topic later.

Essentials of setting up a good performing production Sql Server instance

To set up a Sql Server in production, it is better to get it right for the first time. Obviously, you can’t plan for the best performance when you don’t know how exactly your data will grow, but there are a few key points that won’t change regardless of your data growth pattern. You better get them right from the first day. Two components of a typical sql server system need special attention: First is the log file, and second is the tempdb.

Sql Server writes any transaction on the log file instantly as it happens, it then writes them back to pages. Note that pages might not be written back to disk for a while. If server crashes before writing down a page, you can recover from the log file. Tempdb is also very important bacause sql server does all its internal jobs on tempdb. Stuff like index or table spooling, row versioning, etc all happen at tempdb. Below are the list of cautions you may take to architect a good performing mid size sql server database: (and by mid size I mean anything above 1GB. For larger databases -say above 20GB- you also have to think about partitioning, filegroups, etc as minimum).

1- Keep the log file in a sperate dedicated and fast drive. If you are using a virtual host make sure to get a separate physical drive for ONLY your log file.

2- Run sql service using a specificly created account not the generic service or network accounts. Make sure no one (specially service accounts) has any read or write access to the hard drive with the log file. This will asure that the disk head will not mock around, so that Sql Server can write its log files using the advertised hard disk write speed.

3- Use log shipping and log based replication with caution, because they read from the log file, hence they may make hard disk’s head to dance instead of running.

4- Put the tempdb on a very fast har drive. Possibly an SSD or at least make sure its hard drive is seperated from the database file, log file, and even operating system’s drive.

5- Make sure you have enough RAM. RAMs are cheap but their effect on performance is more than anything. Run a task that monitors the “SqlServer – BufferManagement: Page Life Expectancy” counter periodically, and add more RAM as the value starts heading south.

Obviously, above considerations will not make your slow dead database to rock and roll but it will assure you that effect of the environmental on performance is minimum. Then you can use your smart Sql tachniques to make the db fast and performing.

Where should I write join conditions? In the ON clause or in the WHERE clause?

What is the difference between these two statements?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

And the following query?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = 'V8'

As you can see, both above queries return a join of cars and resellers for V8 cars. And obviously both queries will return same results. But does it mean that there is no difference between these two way of limiting the results by conditions?

Although for inner join you can use both approaches interchangeably, for outer join a subtle difference can catch you. What would you expect to be the outcome of the following query?


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

If you expect to get all the cars with V8 engine left joined to the resellers, you won’t be happy with the result because if there is a car that is not a V8 but has no resellers, will be reflected in the results! You have to move c.Engine=’V8′ to the WHERE clause to guarantee the correct behaviour..


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = "V8"

Reason for this behaviour is that Sql Server will first apply the outer join predicates and then reverts rows that have no right side representation. To understand the condition better, think about the way you write a left join in Linq.


var res = from c in Cars
join rs in Resellers on c.CarId equals rs.CarId
from r in rs.DefaultIfEmpty()
select ...;

Left join in Sql server is also calculated the same way. In the above example you first run the join between Cars and Resellers, then select Null when the resellers collection is empty for the join (i.e. rs.DefaultIfEmpty()).

Top K Query Processing, or FULL SCAN IS THE DEVIL! avoid it

Almost always, users wouldn’t want to see all the results coming back from query. The query result set is usually restricted by conditions and first (or first few) page(s) of response is all user wants to see. If you ask me to describe db optimization in a sentence, I would say the sentence is: avoid full table/index scans. This is the single key to highly performing db applications. However, avoiding full scans is not trivial at all. In fact, it is a hard problem. If your table has n records, full table scan has complexity of O(n). A join between two tables with full scans will have the complexity of O(n^2) and so on. It can get out of control pretty easily.
Have you been in a situation that a tiny 10GB database with just a few million records takes 10 minutes to respond to a query?
Ever wondered how google returns search results from its super massive tables for millions of user on eack keystroke? No matter how expensive your hardware is and how much processing power you have, your machine will be on its knees when a complex query full scans massive tables for a bunch of queries. You have to know your data very well, and you have to know your users (or use cases) very well. Do whatever to avoid full scans on large data sets specially if the query is run frequently. Obviously there is no problem to run a query that takes 5 minutes once every weekend but never a frequently used query should take more that a few seconds. There are heaps you can do to avoid full scans and it is well covered under query processing resources. Read this book to learn some techniques to do such stuff in Ms Sql Server 2008.

Despite all the powerfull tools and smartness that SQL Server or any other commercial DBMS provides, there are cases that top k query processing is not possible at all with just writing SQL. That is when you need to know about the algorithms. Yes! some computer science fun! Top k query processing is a widely studied subject, and there are heaps of methods and techniques that can super drastrically improve the application performance! Unfortunately most comercial DBMSs do not support even easiest top k query processing techniques and algorithms. Hence, you have to code them yourself if you decided to use them. A few top k query processing techniques like TA and PRA are quiet quick and easy to develop but some others are harder and much more complex. This 2008 Acm Survey is an excellent source to gain good understanding of top k query processing thanks to the guys at Waterloo university, ON, Canada.

To be precise. my message on this post is that there is no excuse for long running queries!