Kaggle is good start, but needs to be smarter

Kaglle.com is a website that hosts competitions for data scientists. Regardless of what data scientist means, and where is the line between data scientist (who invents techniques and algorithms) and data analyst (who uses existing tools and techniques to mine knowledge from data), the site is a good start.

I understand the business idea behind kaggle is partly collecting anonymous data analysts to solve data problems for enterprise. In return, individuals get rewards and businesses get their problem solved. Excellent win-win idea, but not so well implemented.

Personally, I tried to engage in the competitions, but I had a hard time motivating myself. Like (possibly) most users of the site, I was not there to win a competition, it would be great if that happens, but what would turn me on was learning from others. Unfortunately, kaggle did not put specific thoughts to promote collaboration. I was hoping to team up with experts, but I faced a real competition environment. Minimum openness and collaboration. This is a good model for sports competitions, and maybe the 1% whom are good enough and just want to win a competition and get some recognition, but wastes all the talents that can do good data analysis but don’t have the skill or will to win a competition.

The market for data analysis is huge, and someone has to start winning it. An average result from the data analysis is actually as valuable as the best possible result. Remember, the renowned Netflix competition, after 2 years and thousand of competitors, the estimation was only 10% better. Let’s face it. World’s top data analysts wouldn’t care about $2,000 prize or a free trip to a conference. Young and fresh data analysts also won’t get enough learning in Kaggle. Data analysis is not as rewarding as hacking is for teenagers, they should understand math, statistics, and maybe algorithms, so hacking competition style doesn’t work for data analysis.

I believe Kaggle is not going to win any significant share in the available market for crowd sourced data analysis. Another start-up with a better approach targeting average data analysts (or scientists if you like) is going to shine sooner or later.

Advertisement

A gesture recognition database for Kinnect!

I was at a demo for Kinnect SDK yesterday. I was thrilled with the technology. Kinnect is a wonderful piece of technology putting together a lot of cut edge technologies and pretty amazing algorithms. The infrared camera does a real magic and picks up a 3D image! A good estimation of depth for each pixel.
If you have ever spent time on image processing and finding blobs or objects using pattern recognition techniques, you know what I mean. Having such an accurate estimation of depth for each pixel can revolutionize any object recognition algorithm.
After the session I was deep thinking how can I do something quick and useful with kinnect! Of course every one else should have been doing the same, but as kinnect is designed for gaming, people tent to do some fun cool stuff with it. However, I am not a bug fan of computer games, so I thought I should do something different.

First thing came into my mind was of course using the depth camera for 3D object recognition and do a robot that moves around the house with a kinnect connected to the laptop. This could be an interesting idea, but sure it would be too much work and no real gain, except a bit of self satisfaction.
My second thought was to do something with the gesture recognition. Like having a cloud based gesture database and a bunch of algorithms to realize a gesture. Yes, this idea sounds promising for a bunch of reasons: First, it does not involve any screw driver (which I hate). Second, it is a database. Third, if it is done properly, people may use it. Third, I don’t actually need a kinnect for 98% of the work! (UPDATE: Check out this amazing free robot simulator by Microsoft and the channel 9 talk, where you can do all robotic things without having kinnect or using any screw or wire.)

Gesture Recognition:
——————————
Gesture recognition techniques are as old as computers, there has been enormous amount of research on it. This IEEE Transactions survey paper is a short survey about the most useful and handy techniques of gesture recognition. Although, there is a big body of knowledge there in the scientific community, I am not going to go that way. So don’t panic. First of all I am not interested to do a research degree on computer vision, and second I prefer a different look to the problem. A developer’s look.

Why not existing gesture recognition techniques:
———————————————————————–
The reason I think I would not need to adopt existing techniques is that, these scientists, did not have kinnect, so they could not look at the problem the way we can. Second, scientists deliberately make the problem artificially harder to get a better publication. We want to make the problem easier to get the app working.

Why not machine learning?
—————————————
Don’t take me wrong. Machine learning is amazing, but not for a typical development project for a variety of reasons: First, machine learning is really hard to implement well, even though there are a few projects that can be utilized for this purpose (like google’s prediction API [http://code.google.com/apis/predict/]), it is still very hard to get something useful out of it. Second, how would you test you project that is depended to some machine learning? Third, how could you debug or fine tune your solution? You will be very limited to machine’s suggestion and believe me, machine’s suggestions can sometime really piss you off!

So what?
————
So lets plan a good approach for a gesture database. I would like to have a simple gesture database that works reasonably well at most of the times. Not planning to beat any existing techniques, but I need to work within the following constraints:
1- Gesture data should be recorded in a Sql Server.
2- Finding a gesture between tens of thousands of gestures should happen instantly. This infers some sort of index should exist.
3- A developer should be able to define a gesture purely using code (or strings). Necessary for testing your solution.
4- No crazy algorithm. Beauty is in simplicity.
5- You don’t need to have kinnect to work with the product.

My suggestion:
——————————

I have already got a suggestion for solving this problem which is not tested but I think can work. Let’s build up a bit of theory around the story.
What is a gesture? Gesture is a bunch of movements of a bunch of body elements which infer a meaning.
What is a gesture but make it simpler? Gesture is pattern of movement of a bunch of point in a timely fashion starting from a starting point.
What is a geture? Please make in simple. Gesture starts with a pose. Then each point of the pose animate in a pattern.
What is a pose? If I am standing with my hands up, without moving, I am in a pose called Hands Up!.
Ok! Then how you make a gesture out of a pose? If I am in a pose (e.g. hands up) and start moving my hand in a left-right pattern for a while I am having a waving gesture.
Good! Now I understand gesture.

Let us assume a pose is location of a bunch of points in space. We define some basic terms for further use:
Space: Let us assume space as the area that a man body fits with open hands. This is of course relative to the body but we normalize that area for every person into a cube.

Pose: A collection of named points (like hand, head, elbow, knee, etc.) in space. The point may be defined with absolute positions or with relative positions from each other.

Perspective: This is a key point for the rest of this approach. A pose can have multiple perspectives. Each perspective is mapping (or reflection) of the pose on one plate of the space. For example one perspective may show a flattened 2D image of a 3D body by correlating each point from the space to the plate. We need three 2D perspectives from a pose:

Top perspective (looking to the person from above).
Left perspective (looking to the person from left) and,
Front perspective (looking to the person from front).

For each 2D perspective, we generate two 1D perspectives:
Horizontal perspective (x position of points), and
Vertical perspective (y position of point)

Using above perspectives a shooting man pose can be expressed with a 6 1D perspectives. (2 1D perspective time three 2D perspectives).
Each 1D perspective can be described with a simple string:
Shooting person:
Perspective: Left.vertical = “F.H1.H2”
Perspective: Left.horizontal = “H1.H2-F”

Let us use the following conventions:
F stands for “Face”
H1 stands for “right hand”
H2 stands for “left hand”
period(.) stands for very short distance
dash(-) stands for average distance
underline(_) stands for long distance

Hence, Left.vertical perspective of the shooting man says: There is a right hand very close to left hand then a bit further there is a head (if you look from the above).
Left.horizontal perspective of the shooting man says: There is left hand, right hand, and a face around the same point if you are looking from the front.

Bingo! We have a language to define a pose!

Let’s assume that the language is good and the string can be really used to define a pose. Of course it is not tested yet but I would be really thankful if someone experiments with a few real pictures.

Now that we can define poses, how can we find a similar pose from a database quickly?
Assuming the pose language is useful, this is not an issue at all. The problem is similar to fuzzy string matching problem. In fact a pose is a string, and you may want to find other string approximately similar to this one (for example, if the position of left and right hands of the shooting person is different it is still a shooting position).
It is not really a hard problem, trust me. It can be used using basic q-gram technique. The only difference is that we may need to have a specific equality comparison logic which can be easily implemented with a Clr type definition:
Here is the equality comparison logic:
H1.H2 is equal to H2.H1 (if two points are very close together we can use them interchangeably)
H2-H1 is not equal to H2-h1 (if the points are distant enough, they can’t be interchangeable)
User may define options for equality of objects, for example if the left hand and right hand does not matter to us, we can say Options.LeftAndRightHandsAreEqual!

I think it is enough for now. Using above technique we can have a pose database and a fast method to search a given pose within millions of poses and find other similar poses in a matter of milliseconds. We can do this for all the perspectives and shortlist the intersections for a much better search quality.
The next challenge (apart implementing and testing the technique) is to define the gesture after the pose.

A little puzzle

A chicken and a half lay an egg and a half in a day and a half. How many eggs would one chicken lay in three days?

Help: 3 is not the correct answer.

Top K Query Processing, or FULL SCAN IS THE DEVIL! avoid it

Almost always, users wouldn’t want to see all the results coming back from query. The query result set is usually restricted by conditions and first (or first few) page(s) of response is all user wants to see. If you ask me to describe db optimization in a sentence, I would say the sentence is: avoid full table/index scans. This is the single key to highly performing db applications. However, avoiding full scans is not trivial at all. In fact, it is a hard problem. If your table has n records, full table scan has complexity of O(n). A join between two tables with full scans will have the complexity of O(n^2) and so on. It can get out of control pretty easily.
Have you been in a situation that a tiny 10GB database with just a few million records takes 10 minutes to respond to a query?
Ever wondered how google returns search results from its super massive tables for millions of user on eack keystroke? No matter how expensive your hardware is and how much processing power you have, your machine will be on its knees when a complex query full scans massive tables for a bunch of queries. You have to know your data very well, and you have to know your users (or use cases) very well. Do whatever to avoid full scans on large data sets specially if the query is run frequently. Obviously there is no problem to run a query that takes 5 minutes once every weekend but never a frequently used query should take more that a few seconds. There are heaps you can do to avoid full scans and it is well covered under query processing resources. Read this book to learn some techniques to do such stuff in Ms Sql Server 2008.

Despite all the powerfull tools and smartness that SQL Server or any other commercial DBMS provides, there are cases that top k query processing is not possible at all with just writing SQL. That is when you need to know about the algorithms. Yes! some computer science fun! Top k query processing is a widely studied subject, and there are heaps of methods and techniques that can super drastrically improve the application performance! Unfortunately most comercial DBMSs do not support even easiest top k query processing techniques and algorithms. Hence, you have to code them yourself if you decided to use them. A few top k query processing techniques like TA and PRA are quiet quick and easy to develop but some others are harder and much more complex. This 2008 Acm Survey is an excellent source to gain good understanding of top k query processing thanks to the guys at Waterloo university, ON, Canada.

To be precise. my message on this post is that there is no excuse for long running queries!

Are you a super smart senior consultant? See if you can pass my interview.

I have been looking in the market to change my job in last few weeks and have faced very interesting stuff that is worth writing a book, at least a short story about.

I have seen a christian who wanted to develop the ultimate database inspired from Jesus, have been to interviews where the company has not actually had a position yet, and I have faced people who think are super smart senior consultants and know how to avoid an object being finalized in garbage collector.

If you are one of those super smart senior consultants who knows everything see if you can answer this very basic classic database question. If you can’t you should seriously rethink about your title. Seriously, I can’t  believe someone developing business application for more than a decade and not being able to solve this problem in an interview:

Q. We have a couple of tables in a database. First table has column X and second table has column Y. First table has a foreign key relationship to the second table. What do you do to calculate “SELECT TOP 1 * FROM X JOIN Y … ORDER BY a*X+b*Y DESCwithout a full scan on any table; assuming a and b are arbitrary positive numbers provided by user each time? You don’t necessarily have to solve it with only SQL and you are allowed to use C#. Above query is only for defining the problem.

Justification: Why I think this can be a valid interview question? Two reasons: 1) Think how many time you have faced a client complaining from performance and how many time you where in a situation that knowing how to avoid an object from being finalized makes the customer happy? 2) How easy is it to get the knowledge of algorithms for fast data processing on site in contrast to some rare technical terminology? Google avoiding an object from being finalized and you quickly get your answer in a few seconds but I would be interested to know how you use google to solve this problem?

DBLP for SQL Server

DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free.  It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.

I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.

Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.

It basically converts dblp’s cml to flat csv’s that you can import to your DBMS. A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.

(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)

To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.

You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
>dblp_xml2csv.exe dblp-data.xml

You can replace dblp-data.xml with the path of the dblp xml file.

There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.

Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.

Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.

View Matching

View Matching is a fairly old technique first seriously utilised at the beginning of the millennium. Lars et al from Microsoft research exploited this technique for SQL Server 2000.

The idea is to utilise existing materialised views (or indexed views in SQL Server) for query optimisations. For example, if you query “SELECT * FROM Cars WHERE Brand=’BMW’ and PRICE BETWEEN 40,000 AND 70,000” and there is already an indexed views over the cars called v_LuxuryCars which is defined as “SELECT * FROM Cars WHERE PRICE > 40,000”, the query optimiser can exploit v_LuxuryCars for executing the query. This view can be much smaller than the base table and the query can run faster. The real beauty of exploiting materialised views will be obvious in more complex queries though.

The challenge here is how to efficiently figure out if a view can be used for the query. Indeed, if the query predicate, is a subset of the view definition predicate (like the given example; PRICE BETWEEN 40k AND 80K is a subset of PRICE > 40K). This is not straight forward since the predicates can be really complicated. For example query may have the predicate A=2 and B=2, while the view definition has the predicate A=B. In that case the inference of the predicate sub/super relationship is  not obvious).

However, this problem is well studied for relational algebra and even though Microsoft have been very active in developing View Matching techniques, it is not used in the SQL Server core, possibly for performance reasons.

The reason I am interested in view matching is that I have started a project for View Matching over IQueriable interface. I will keep you posted about the progress.

What has Brisbane floods got to do with Database? And why computer science can save lives!

Last few days Brisbane and Ipswich faced devastating flood in some areas up to 21 m. This means that some houses have been more than 17 meters under water! I have been isolated in a virtual island with no utilities for more than three days, and I have been extremely lucky that flood water did not reach my house. The disaster was devastating in the scale of the Queensland state (as large as England) . One-third of Ipswich city was under water. Unfortunately, the disaster claimed 12 lives so far, but in the absence of technology, this could be far more.

In the University of Queensland, we have been dealing with negotiations of a database project which if has been done, the destructive effect of the project could be far less. The problem ( basically a data quality related issue and very hard to solve ) is the old challenge of schema matching.

There are thousands of water monitoring systems in Australia  all over the land. They are all managed separately  and do not have much in common in terms of  IT. One might have an excel sheet filled manually, another may use a linux tool  to generate a bunch of txt files from measurements, many Access databases or ORACLE and SQL Server etc. All with different designs an schemas. It is an extremely sparse sensor network system. Obviously standardising all the schemas is possible, but is out of our hands and might be a very lengthy and expensive process. So what we could do here in UQ was to develop a smart system that looks at all different thousands of databases and matches their schemas and provides a holistic unified view over everything. This is a classic problem which have been worked on (in Computer Science community not industry) for three decades.

Apparently this project did not go anywhere, but having it won and finished gracefully, all the water movements could be monitored in the matter of hours and flash floods could have been predicted and lives could have been saved!

How to simulate bag access in Windows azure table storage? (Part 1)

It is nice to hear that Microsoft is providing table storage. Hopefully we can get it for non-azure platforms as well. The idea is fast and scalable access to persisted objects without limitations of tabular world. No doubt that relational databases are amazing and let for super complex queries and transactions to happen. Downside is their complexity of design and usage. It tends to be extremely hard to provide real scalable relational data yet satisfying service level agreements on response time, availability etc.

Efforts on developing non-relational non-schema bound data sets are as old as databases, and in the cloud era, they make so much sense. For example Mnesia is a lovely database designed to work with Erlang with a LINQ-like query language. Enough to say it is developed in 80’s and is easy to scale, and provides 100% uptime (you get a mechanism to do hot patching). I also read about this database (RavenDB) a few days ago which is based on a similar motive.

One important thing to remember when working with non-relational databases, is that they are not relational. Thus, you don’t run SQL scripts against them and there is no join, no views, no foreign keys and primary keys. These terms make sense for tabular data. Databases like table storage are semi-structured data storage. Structured is tabular and relational data storage store them. Semi structure is XML, JSON, or any other form of persisted object. Unstructured is web and free-form text, etc.

Mnesia (as a pioneer of table-storage like databases) stores data in set’s and bags. A set is a table, which each record has a unique key. Fair enough, we are used to work with table with primary key which is the same. But a bag, is a table in which many records can share a key, hence there might be no way to access a single row of a table because it does not have a unique key (You may say now, WTF? what happens to my candidate keys and primary keys – and my answer is wait a minute. We are not in relational world, so non of these terms exist here).

So what is the value of having a row in a table which we can not access it directly? It of course has some value. Bearing in mind again that table storage is not relational, a good design paradigm is to NEVER query anything except the key (and of course partition key for table storage). Any other query (which is not bounded to partition key for table storage) is similar to a full table scan in you SQL Server database and full table (or index) scan is is THE killer. You can never become scalable if you have a single operation with full table scan over your growing data.

to be continued…