Where should I write join conditions? In the ON clause or in the WHERE clause?

What is the difference between these two statements?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

And the following query?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = 'V8'

As you can see, both above queries return a join of cars and resellers for V8 cars. And obviously both queries will return same results. But does it mean that there is no difference between these two way of limiting the results by conditions?

Although for inner join you can use both approaches interchangeably, for outer join a subtle difference can catch you. What would you expect to be the outcome of the following query?


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

If you expect to get all the cars with V8 engine left joined to the resellers, you won’t be happy with the result because if there is a car that is not a V8 but has no resellers, will be reflected in the results! You have to move c.Engine=’V8′ to the WHERE clause to guarantee the correct behaviour..


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = "V8"

Reason for this behaviour is that Sql Server will first apply the outer join predicates and then reverts rows that have no right side representation. To understand the condition better, think about the way you write a left join in Linq.


var res = from c in Cars
join rs in Resellers on c.CarId equals rs.CarId
from r in rs.DefaultIfEmpty()
select ...;

Left join in Sql server is also calculated the same way. In the above example you first run the join between Cars and Resellers, then select Null when the resellers collection is empty for the join (i.e. rs.DefaultIfEmpty()).

Advertisement

Automatic Mid-tier Cache!

I have been thinking about this for a while. I started to work on a Silverlight project last year which was a nifty little business app. One of those apps that is just sitting down and working, you know, a web server, an average DB with a few million records, and a silverlight client with a bunch of forms, grids, and menus.

Everything was straightforward and brainless, like just follow a pattern and do the job. However, I was deeply dis satisfied from one aspect of the project. The Caching. What I could not accept was that in 21st century, when you have Entity Framework and IQueryable, you should still manually cache your data when it is appropriate and do all the pointless work of cache invalidation and loading, etc. Apart from the pain of working on something that should have been automated, I wouldn’t trust a programmer to decide which parts of the data should be cached and which part shouldn’t be. Not that I don’t believe they can do a good job on that, they don’t have enough information (at the time of dev work) to decide on it.

Caching strategy should be based on user behaviour and is subject to change by passage of time. For example at some stage lots of QLD pharmacies are queried, but next week NSW users decide to get ready for their conferences and start hammering the system for NSW pharmacies.

Le me be clear about my expectations of a caching system. It should have the following charachteristics:

  1. It should know what users are going to query a lot and cache that (and only that) part of the database.
  2. It should be able to re-use the caches. For example if I say 1.”Give me all QLD pharmacies”, and next one says 2.”Give me all QLD Chemists Warehouses”, the cache manager should be smart enough to run this new query 2., over the results of query 1. which has been retrieved a few minutes ago.
  3. It should optimize the indexes for performance based on the user queries.
  4. It should change the cache when user behaviour changes.
  5. It can call back the database only if there is absolutely no way of answering the query from cache.

Above requests seems to be a lot, but not really in 2011. All these methods are possible, in fact DBMSs do those kind of stuff for ages. We also have IQueryable, which makes it even easier to have a decent caching system.

So let me write a few examples:

Q1: Pharmacies.Join( … Address …).Join( … State …).Where( s => s.Sate = “QLD”).Select(…)

Q2: Pharmacies.Join( … Address …).Join( … State …).Where( a =>a.Sate = “QLD” && a.PostCode > 4000 && a.PostCode<4079).Select(…)

Q3: Pharmacies.Join( … Address …).Join( … State …).Where( s => s.Sate = “QLD”).GroupBy(…).Where( pg => pg.Count() > 4).Select(…)

Q4:  PharmacyStaff.Where(ps => ps.Position == “Manager”).Select(…)

Q5: Pharmacies.Join( … Address …).Join( … State …).Join(…PharmacyStaff…).Where( s => s.Sate=”QLD” && s.Position == “Manager” ).Select(…)

Users login to our system  and do stuff that will cause the above queries to be issued. Normally they will all be issued against the database, but it means that our caching strategy is stupid as a donkey. Really what I would expect is that only Q1 and Q4 are ran against the database. Q2, Q3, and Q4 are all subsets of Q1, hence if we already have those results from Q1, such a waste to run these new queries against the database. Why not look at the Expression Tree and figure out that Q2 is forms a query which is a subset of Q1. Then change the queries as below:

Q1: not changed…

Q2: Q1.Where(a => a.PostCode > 4000 && a. PostCode<4079).Select(…)

Q3: Q1.GroupBy(…).Where(…)

Q4: not changed…

Q5: Q1.Join(…Q4…).Select(…)

Check out the above queries. Aren’t they much better. We don’t expect user or programmer to waste his time on translating those queries. The caching system should do that. It should be an IQueryable that reads the ExpressionTree and translates it into a new ExpressionTree that uses existing data in the cache if there is no need to query database.

This specially make  sense in CLOUD, where you have to pay for querying your SQL Asure.

Enough talking about the dreams, lets become realistic! I did a bit of research and as I expected no such caching manager exists (if you know some tell me and save my hair). So I decided to do it myself. Check the Auto-mid-tier-cache project which I have already started. I haven’t gone far with it yet. It is just a proof on concepts and it implements no IQueryable. It uses a set of objects defined by myself for Relational Algebra operators. It does the very basic of view-matching to find what query is subset of what other, and it is able to translates queries to run against the database or cache alternatively.

I ran it and it worked fine and a bunch of benchmark analysis proved its effectiveness. What is left now is to complete the view-matching and write an IQueryable on top of it. Lot of work but it is worth it.

I forgot to say that you can limit the cache size by setting cost upper-bound. Next issue is that it does not keep itself up-to-date, but this is really another story.

Building an IQueryable Relational Algebra Provider

In the journey of implementing view matching technique for IQueryable, we first need to simplify the language. The only language I am able to implement query optimisation with, is Relational Algebra. What I am going to explore now, is the possibility of implementation of a LINQ relational algebra provider as a common language for every query optimiser and hence, the view matcher I am going to make.

We want to reduce everything down to three relational algebra operation:

  • Selection
  • Projection
  • Join

Subqueries are also an important topic, but for simplicity let not support it.

Initially we should prepare a framework to convert as much as possible into the above three formats. Then we need to implement some few basic operators:

  • Selection: Commutativity, Selection Pushing, Selection Splitting
  • Projection: Projection Pushing
  • Join: Commutativity, Non-associativity

Let’s leave the details to another post.

View Matching

View Matching is a fairly old technique first seriously utilised at the beginning of the millennium. Lars et al from Microsoft research exploited this technique for SQL Server 2000.

The idea is to utilise existing materialised views (or indexed views in SQL Server) for query optimisations. For example, if you query “SELECT * FROM Cars WHERE Brand=’BMW’ and PRICE BETWEEN 40,000 AND 70,000” and there is already an indexed views over the cars called v_LuxuryCars which is defined as “SELECT * FROM Cars WHERE PRICE > 40,000”, the query optimiser can exploit v_LuxuryCars for executing the query. This view can be much smaller than the base table and the query can run faster. The real beauty of exploiting materialised views will be obvious in more complex queries though.

The challenge here is how to efficiently figure out if a view can be used for the query. Indeed, if the query predicate, is a subset of the view definition predicate (like the given example; PRICE BETWEEN 40k AND 80K is a subset of PRICE > 40K). This is not straight forward since the predicates can be really complicated. For example query may have the predicate A=2 and B=2, while the view definition has the predicate A=B. In that case the inference of the predicate sub/super relationship is  not obvious).

However, this problem is well studied for relational algebra and even though Microsoft have been very active in developing View Matching techniques, it is not used in the SQL Server core, possibly for performance reasons.

The reason I am interested in view matching is that I have started a project for View Matching over IQueriable interface. I will keep you posted about the progress.