WooWhoo! Carry Your Shelf on a Flash Drive!

Ever fancied to carry your shelf on a flash drive or put it on your dropbox?

Portable Version is a portable tool to synchronize any folder with your “Version Controlled” folder (currently only TFS).

Want to take your solution home, put it on Dropbox and work from anywhere freely? Check back in and resolve conflicts?

I initially wrote this little tool for my own use but you may also benefit from it.

A gesture recognition database for Kinnect!

I was at a demo for Kinnect SDK yesterday. I was thrilled with the technology. Kinnect is a wonderful piece of technology putting together a lot of cut edge technologies and pretty amazing algorithms. The infrared camera does a real magic and picks up a 3D image! A good estimation of depth for each pixel.
If you have ever spent time on image processing and finding blobs or objects using pattern recognition techniques, you know what I mean. Having such an accurate estimation of depth for each pixel can revolutionize any object recognition algorithm.
After the session I was deep thinking how can I do something quick and useful with kinnect! Of course every one else should have been doing the same, but as kinnect is designed for gaming, people tent to do some fun cool stuff with it. However, I am not a bug fan of computer games, so I thought I should do something different.

First thing came into my mind was of course using the depth camera for 3D object recognition and do a robot that moves around the house with a kinnect connected to the laptop. This could be an interesting idea, but sure it would be too much work and no real gain, except a bit of self satisfaction.
My second thought was to do something with the gesture recognition. Like having a cloud based gesture database and a bunch of algorithms to realize a gesture. Yes, this idea sounds promising for a bunch of reasons: First, it does not involve any screw driver (which I hate). Second, it is a database. Third, if it is done properly, people may use it. Third, I don’t actually need a kinnect for 98% of the work! (UPDATE: Check out this amazing free robot simulator by Microsoft and the channel 9 talk, where you can do all robotic things without having kinnect or using any screw or wire.)

Gesture Recognition:
——————————
Gesture recognition techniques are as old as computers, there has been enormous amount of research on it. This IEEE Transactions survey paper is a short survey about the most useful and handy techniques of gesture recognition. Although, there is a big body of knowledge there in the scientific community, I am not going to go that way. So don’t panic. First of all I am not interested to do a research degree on computer vision, and second I prefer a different look to the problem. A developer’s look.

Why not existing gesture recognition techniques:
———————————————————————–
The reason I think I would not need to adopt existing techniques is that, these scientists, did not have kinnect, so they could not look at the problem the way we can. Second, scientists deliberately make the problem artificially harder to get a better publication. We want to make the problem easier to get the app working.

Why not machine learning?
—————————————
Don’t take me wrong. Machine learning is amazing, but not for a typical development project for a variety of reasons: First, machine learning is really hard to implement well, even though there are a few projects that can be utilized for this purpose (like google’s prediction API [http://code.google.com/apis/predict/]), it is still very hard to get something useful out of it. Second, how would you test you project that is depended to some machine learning? Third, how could you debug or fine tune your solution? You will be very limited to machine’s suggestion and believe me, machine’s suggestions can sometime really piss you off!

So what?
————
So lets plan a good approach for a gesture database. I would like to have a simple gesture database that works reasonably well at most of the times. Not planning to beat any existing techniques, but I need to work within the following constraints:
1- Gesture data should be recorded in a Sql Server.
2- Finding a gesture between tens of thousands of gestures should happen instantly. This infers some sort of index should exist.
3- A developer should be able to define a gesture purely using code (or strings). Necessary for testing your solution.
4- No crazy algorithm. Beauty is in simplicity.
5- You don’t need to have kinnect to work with the product.

My suggestion:
——————————

I have already got a suggestion for solving this problem which is not tested but I think can work. Let’s build up a bit of theory around the story.
What is a gesture? Gesture is a bunch of movements of a bunch of body elements which infer a meaning.
What is a gesture but make it simpler? Gesture is pattern of movement of a bunch of point in a timely fashion starting from a starting point.
What is a geture? Please make in simple. Gesture starts with a pose. Then each point of the pose animate in a pattern.
What is a pose? If I am standing with my hands up, without moving, I am in a pose called Hands Up!.
Ok! Then how you make a gesture out of a pose? If I am in a pose (e.g. hands up) and start moving my hand in a left-right pattern for a while I am having a waving gesture.
Good! Now I understand gesture.

Let us assume a pose is location of a bunch of points in space. We define some basic terms for further use:
Space: Let us assume space as the area that a man body fits with open hands. This is of course relative to the body but we normalize that area for every person into a cube.

Pose: A collection of named points (like hand, head, elbow, knee, etc.) in space. The point may be defined with absolute positions or with relative positions from each other.

Perspective: This is a key point for the rest of this approach. A pose can have multiple perspectives. Each perspective is mapping (or reflection) of the pose on one plate of the space. For example one perspective may show a flattened 2D image of a 3D body by correlating each point from the space to the plate. We need three 2D perspectives from a pose:

Top perspective (looking to the person from above).
Left perspective (looking to the person from left) and,
Front perspective (looking to the person from front).

For each 2D perspective, we generate two 1D perspectives:
Horizontal perspective (x position of points), and
Vertical perspective (y position of point)

Using above perspectives a shooting man pose can be expressed with a 6 1D perspectives. (2 1D perspective time three 2D perspectives).
Each 1D perspective can be described with a simple string:
Shooting person:
Perspective: Left.vertical = “F.H1.H2”
Perspective: Left.horizontal = “H1.H2-F”

Let us use the following conventions:
F stands for “Face”
H1 stands for “right hand”
H2 stands for “left hand”
period(.) stands for very short distance
dash(-) stands for average distance
underline(_) stands for long distance

Hence, Left.vertical perspective of the shooting man says: There is a right hand very close to left hand then a bit further there is a head (if you look from the above).
Left.horizontal perspective of the shooting man says: There is left hand, right hand, and a face around the same point if you are looking from the front.

Bingo! We have a language to define a pose!

Let’s assume that the language is good and the string can be really used to define a pose. Of course it is not tested yet but I would be really thankful if someone experiments with a few real pictures.

Now that we can define poses, how can we find a similar pose from a database quickly?
Assuming the pose language is useful, this is not an issue at all. The problem is similar to fuzzy string matching problem. In fact a pose is a string, and you may want to find other string approximately similar to this one (for example, if the position of left and right hands of the shooting person is different it is still a shooting position).
It is not really a hard problem, trust me. It can be used using basic q-gram technique. The only difference is that we may need to have a specific equality comparison logic which can be easily implemented with a Clr type definition:
Here is the equality comparison logic:
H1.H2 is equal to H2.H1 (if two points are very close together we can use them interchangeably)
H2-H1 is not equal to H2-h1 (if the points are distant enough, they can’t be interchangeable)
User may define options for equality of objects, for example if the left hand and right hand does not matter to us, we can say Options.LeftAndRightHandsAreEqual!

I think it is enough for now. Using above technique we can have a pose database and a fast method to search a given pose within millions of poses and find other similar poses in a matter of milliseconds. We can do this for all the perspectives and shortlist the intersections for a much better search quality.
The next challenge (apart implementing and testing the technique) is to define the gesture after the pose.

Ever wondered how text search works? Let’s do it!

You have certainly seen the fuzzy text searches everywhere. For example you type “stck” but you actually mean “stack”! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. “q” in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. “Stack” will be broken into a set of {“st”, “ta”, “ac”, “ck”} or “database” will be broken into {“da”,”at”,”ta”,”ba”,”as”,”se”}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped “stck”, any search for “stck” will not match “stack” because “a” is missing, but the 2-gram set {“st”,”tc”,”ck”} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of “star” so we can easily suggest the user that he meant to type: first “stack” or second, “star”.

Now let’s implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word “stack” which will be broken to: {‘st’,’ta’,’ac’,’ck’} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

Freebase

Freebase is by far the most flexible database I have heard of. It is a very cool concept indeed, and it allows amazing discoveries. It also provides a free sevice for querying and inserting data.
Freebase schema is this: Domain, Type, and property. That’s it. And with this little schema you can have everything, absoloutly everything. Ofcouse flexibility comes with a big price tag on cost of managing stuff that we take for granted in our rdbmses. For example assurance of the integrity of data sounds to be very challenging. This doesnt make freebase style the scema of choice for enterprise, but wait a minte! Really interprises can not benefit from such database? Sure they can and I argue it is a necessity for any big enterprise.
I keep it short for today and will revisit the topic later.

Essentials of setting up a good performing production Sql Server instance

To set up a Sql Server in production, it is better to get it right for the first time. Obviously, you can’t plan for the best performance when you don’t know how exactly your data will grow, but there are a few key points that won’t change regardless of your data growth pattern. You better get them right from the first day. Two components of a typical sql server system need special attention: First is the log file, and second is the tempdb.

Sql Server writes any transaction on the log file instantly as it happens, it then writes them back to pages. Note that pages might not be written back to disk for a while. If server crashes before writing down a page, you can recover from the log file. Tempdb is also very important bacause sql server does all its internal jobs on tempdb. Stuff like index or table spooling, row versioning, etc all happen at tempdb. Below are the list of cautions you may take to architect a good performing mid size sql server database: (and by mid size I mean anything above 1GB. For larger databases -say above 20GB- you also have to think about partitioning, filegroups, etc as minimum).

1- Keep the log file in a sperate dedicated and fast drive. If you are using a virtual host make sure to get a separate physical drive for ONLY your log file.

2- Run sql service using a specificly created account not the generic service or network accounts. Make sure no one (specially service accounts) has any read or write access to the hard drive with the log file. This will asure that the disk head will not mock around, so that Sql Server can write its log files using the advertised hard disk write speed.

3- Use log shipping and log based replication with caution, because they read from the log file, hence they may make hard disk’s head to dance instead of running.

4- Put the tempdb on a very fast har drive. Possibly an SSD or at least make sure its hard drive is seperated from the database file, log file, and even operating system’s drive.

5- Make sure you have enough RAM. RAMs are cheap but their effect on performance is more than anything. Run a task that monitors the “SqlServer – BufferManagement: Page Life Expectancy” counter periodically, and add more RAM as the value starts heading south.

Obviously, above considerations will not make your slow dead database to rock and roll but it will assure you that effect of the environmental on performance is minimum. Then you can use your smart Sql tachniques to make the db fast and performing.

Prevent Un-authorized modifications in SAF

There is an issue currently with SAF (Simple Authorization Framework) when it is used for entities in Entity Framework. The issue is that when SAF reads an entity and tests it against the user principal, it would modify the properties which user has no read access to null. There are also properties that uses hos no edit access which Saf can notify the client about it but does not do anything at this stage to prevent un authorized edit on objects with partial permission.

For example if you modify a book object, which you can edit the title, but are not allowed to edit the price, there is no check on the server to make sure you haven’t done so. I call this partial edit access. If you don’t have edit access to the book object, then Saf prevents you from changing any thing.

Supporting partial edit access for entity framework is not hard, but you should be very cautious of doing so due to possible performance implications. In a 3 tier architecture, you don’t keep the object context alive for the whole session, thus, the original object is not kept anywhere when the service is called back for an update. The way EF ensures optimistic concurrency is that it compares all values inside the sql update statement for the fields with CuncurrencyMode set to Fixed to ensure data is not modified in the background, if there has been any modification to the data, a conflict exception is thrown and user has the choice to resolve the conflict.

However, we don’t have the luxury of implementing the authorization checks to the SQL statement, because Saf does not know how to create Sql queries. Instead, we have to load the actual object in the datastore and ensure that modifications are valid. This is possible by using GetObjectByKey method on the ObjectContext:


var oc = ...// the object context
var storeObject = oc.GetObjectByKey(((IEntityWithKey)modifiedObejct).EntityKey);
//Compare properties of the storeObject with modifiedObject to ensure no un- authorized
//change has happened

Above code can compare the store values with current values for properties of which user has no edit permission. Obviously there is a performance hit problem with this approach because an extra query is sent to the database for each modification that involves partial edit or view access on an object. There is also another issue with this approach; the logic here conflicts with the logic of optimistic concurrency check. For example if user has no view access to the price of the book, Saf will put null in the price field. When data is sent back to the server, context assumes that the object is modified and tries to write back the null value. If we replace the null with current store value, un-authorized properties are changed back, but what happens if someone else has changed these properties before? There should be a conflict, but we have overwritten that conflict with our authorization logic.

Therefore, I am still no sure if I should implement this as part of the Saf solution.

Alternative solution is to check the modified properties using some code similar to this:

var unauthorizedProps = ...; //List of unauthorized properties
var modiProps = objContext.ObjectStateManager.GetObjectStateEntry(modifiedObj).GetModifiedProperties().ToList();
//Mark the object unchanged
objContext.ObjectStateManager.ChangeObjectState(modifedObj, EntityState.Unchanged);
foreach(var prop in modiProps.Except(unauthorizedProps))
{
objContext.ObjectStateManager.GetObjectStateEntry(modifiedObj).SetModifiedProperty(prop);
}

Above code would find all the modified properties, filters out just authorized ones, sets the entity as unmodified and markes the authorized props to be modified. This is much better performer than the previous approach. One of these approaches may eventually find their way through Saf if there is no better way to manage this issue.

Where should I write join conditions? In the ON clause or in the WHERE clause?

What is the difference between these two statements?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

And the following query?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = 'V8'

As you can see, both above queries return a join of cars and resellers for V8 cars. And obviously both queries will return same results. But does it mean that there is no difference between these two way of limiting the results by conditions?

Although for inner join you can use both approaches interchangeably, for outer join a subtle difference can catch you. What would you expect to be the outcome of the following query?


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

If you expect to get all the cars with V8 engine left joined to the resellers, you won’t be happy with the result because if there is a car that is not a V8 but has no resellers, will be reflected in the results! You have to move c.Engine=’V8′ to the WHERE clause to guarantee the correct behaviour..


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = "V8"

Reason for this behaviour is that Sql Server will first apply the outer join predicates and then reverts rows that have no right side representation. To understand the condition better, think about the way you write a left join in Linq.


var res = from c in Cars
join rs in Resellers on c.CarId equals rs.CarId
from r in rs.DefaultIfEmpty()
select ...;

Left join in Sql server is also calculated the same way. In the above example you first run the join between Cars and Resellers, then select Null when the resellers collection is empty for the join (i.e. rs.DefaultIfEmpty()).

A little puzzle

A chicken and a half lay an egg and a half in a day and a half. How many eggs would one chicken lay in three days?

Help: 3 is not the correct answer.

Top K Query Processing, or FULL SCAN IS THE DEVIL! avoid it

Almost always, users wouldn’t want to see all the results coming back from query. The query result set is usually restricted by conditions and first (or first few) page(s) of response is all user wants to see. If you ask me to describe db optimization in a sentence, I would say the sentence is: avoid full table/index scans. This is the single key to highly performing db applications. However, avoiding full scans is not trivial at all. In fact, it is a hard problem. If your table has n records, full table scan has complexity of O(n). A join between two tables with full scans will have the complexity of O(n^2) and so on. It can get out of control pretty easily.
Have you been in a situation that a tiny 10GB database with just a few million records takes 10 minutes to respond to a query?
Ever wondered how google returns search results from its super massive tables for millions of user on eack keystroke? No matter how expensive your hardware is and how much processing power you have, your machine will be on its knees when a complex query full scans massive tables for a bunch of queries. You have to know your data very well, and you have to know your users (or use cases) very well. Do whatever to avoid full scans on large data sets specially if the query is run frequently. Obviously there is no problem to run a query that takes 5 minutes once every weekend but never a frequently used query should take more that a few seconds. There are heaps you can do to avoid full scans and it is well covered under query processing resources. Read this book to learn some techniques to do such stuff in Ms Sql Server 2008.

Despite all the powerfull tools and smartness that SQL Server or any other commercial DBMS provides, there are cases that top k query processing is not possible at all with just writing SQL. That is when you need to know about the algorithms. Yes! some computer science fun! Top k query processing is a widely studied subject, and there are heaps of methods and techniques that can super drastrically improve the application performance! Unfortunately most comercial DBMSs do not support even easiest top k query processing techniques and algorithms. Hence, you have to code them yourself if you decided to use them. A few top k query processing techniques like TA and PRA are quiet quick and easy to develop but some others are harder and much more complex. This 2008 Acm Survey is an excellent source to gain good understanding of top k query processing thanks to the guys at Waterloo university, ON, Canada.

To be precise. my message on this post is that there is no excuse for long running queries!

Are you a super smart senior consultant? See if you can pass my interview.

I have been looking in the market to change my job in last few weeks and have faced very interesting stuff that is worth writing a book, at least a short story about.

I have seen a christian who wanted to develop the ultimate database inspired from Jesus, have been to interviews where the company has not actually had a position yet, and I have faced people who think are super smart senior consultants and know how to avoid an object being finalized in garbage collector.

If you are one of those super smart senior consultants who knows everything see if you can answer this very basic classic database question. If you can’t you should seriously rethink about your title. Seriously, I can’t  believe someone developing business application for more than a decade and not being able to solve this problem in an interview:

Q. We have a couple of tables in a database. First table has column X and second table has column Y. First table has a foreign key relationship to the second table. What do you do to calculate “SELECT TOP 1 * FROM X JOIN Y … ORDER BY a*X+b*Y DESCwithout a full scan on any table; assuming a and b are arbitrary positive numbers provided by user each time? You don’t necessarily have to solve it with only SQL and you are allowed to use C#. Above query is only for defining the problem.

Justification: Why I think this can be a valid interview question? Two reasons: 1) Think how many time you have faced a client complaining from performance and how many time you where in a situation that knowing how to avoid an object from being finalized makes the customer happy? 2) How easy is it to get the knowledge of algorithms for fast data processing on site in contrast to some rare technical terminology? Google avoiding an object from being finalized and you quickly get your answer in a few seconds but I would be interested to know how you use google to solve this problem?