Building an IQueryable Relational Algebra Provider

In the journey of implementing view matching technique for IQueryable, we first need to simplify the language. The only language I am able to implement query optimisation with, is Relational Algebra. What I am going to explore now, is the possibility of implementation of a LINQ relational algebra provider as a common language for every query optimiser and hence, the view matcher I am going to make.

We want to reduce everything down to three relational algebra operation:

  • Selection
  • Projection
  • Join

Subqueries are also an important topic, but for simplicity let not support it.

Initially we should prepare a framework to convert as much as possible into the above three formats. Then we need to implement some few basic operators:

  • Selection: Commutativity, Selection Pushing, Selection Splitting
  • Projection: Projection Pushing
  • Join: Commutativity, Non-associativity

Let’s leave the details to another post.


Column Store Databases

Most RDBMS systems that are common, i.e. SQL Server, Oracle, etc. store data in rows. Indeed a bunch of rows form a page and a page is virtually the unit block that can be loaded into memory. This model is helpful when large amount of data should be store on disk. Due to slow speed operation of disk you want to keep related information physically as close as possible.

This also has a dark side to it by making redundancy in data. One obvious thing is that most of the data that we use comes from a limited domain, for example list of cities, post codes, countries. Names, area codes, etc. We usually don’t care about storage much since storage is the cheapest thing to get these days, yet it comes with the price of performance.

If you want to count the number of distinct countries in your database of 10,000,000 records, you need to pass through the whole 10M. Then if you are trying to run expensive categorization, ontology extraction, etc on multiple column you have to deal with tons of duplicate values which slow you down.

Also if you want to change schema of a data sets dynamically, you have to deal with extra complexity. That is when column store data stores come to rescue with dramatic performance improvement. Yet there are certain tasks that will be very slow on such data bases.

Column store databases like MonetDB store data domain in column, and maintain relationships as pointers to these data. This is much like traditional way of storing data in memory where you have your actual data objects somewhere but you organise a List<T> as a list of pointers to your data. You just keep one copy of month names and everything else is pointing to that.

UPDAE — MonetDB has an interesting property that it is designed to break the columns into sizes that fit into the CPU cache. The “Memory Wall” is a big problem for most modern DBMSs as random access to large pages of data which do not fit in the CPU cache significantly reduces the performance of the data processing. As column store databases work very well with bulk operations, and optimised set of relational algebra operators called BQA (Bulk Query Algebra).  Bonk’s group at CWI, Amsterdam who have developed MonetDB have an interesting little paper that describes the history and future of their work. I suggest you should take a look at their paper (Download it from google scholar).

Although it provides flexibility and good performance hit in memory, the story turns back on disk. Indeed the performance for getting data from one relation is as if you had the relation joined to another table for every single column of it.

I think column storage is a very good candidate for caching data. I have been playing with this concept in the auto-midtier-cache project a bit, you can also take a look.