What has Brisbane floods got to do with Database? And why computer science can save lives!

Last few days Brisbane and Ipswich faced devastating flood in some areas up to 21 m. This means that some houses have been more than 17 meters under water! I have been isolated in a virtual island with no utilities for more than three days, and I have been extremely lucky that flood water did not reach my house. The disaster was devastating in the scale of the Queensland state (as large as England) . One-third of Ipswich city was under water. Unfortunately, the disaster claimed 12 lives so far, but in the absence of technology, this could be far more.

In the University of Queensland, we have been dealing with negotiations of a database project which if has been done, the destructive effect of the project could be far less. The problem ( basically a data quality related issue and very hard to solve ) is the old challenge of schema matching.

There are thousands of water monitoring systems in Australia  all over the land. They are all managed separately  and do not have much in common in terms of  IT. One might have an excel sheet filled manually, another may use a linux tool  to generate a bunch of txt files from measurements, many Access databases or ORACLE and SQL Server etc. All with different designs an schemas. It is an extremely sparse sensor network system. Obviously standardising all the schemas is possible, but is out of our hands and might be a very lengthy and expensive process. So what we could do here in UQ was to develop a smart system that looks at all different thousands of databases and matches their schemas and provides a holistic unified view over everything. This is a classic problem which have been worked on (in Computer Science community not industry) for three decades.

Apparently this project did not go anywhere, but having it won and finished gracefully, all the water movements could be monitored in the matter of hours and flash floods could have been predicted and lives could have been saved!


How to simulate bag access in Windows azure table storage? (Part 1)

It is nice to hear that Microsoft is providing table storage. Hopefully we can get it for non-azure platforms as well. The idea is fast and scalable access to persisted objects without limitations of tabular world. No doubt that relational databases are amazing and let for super complex queries and transactions to happen. Downside is their complexity of design and usage. It tends to be extremely hard to provide real scalable relational data yet satisfying service level agreements on response time, availability etc.

Efforts on developing non-relational non-schema bound data sets are as old as databases, and in the cloud era, they make so much sense. For example Mnesia is a lovely database designed to work with Erlang with a LINQ-like query language. Enough to say it is developed in 80’s and is easy to scale, and provides 100% uptime (you get a mechanism to do hot patching). I also read about this database (RavenDB) a few days ago which is based on a similar motive.

One important thing to remember when working with non-relational databases, is that they are not relational. Thus, you don’t run SQL scripts against them and there is no join, no views, no foreign keys and primary keys. These terms make sense for tabular data. Databases like table storage are semi-structured data storage. Structured is tabular and relational data storage store them. Semi structure is XML, JSON, or any other form of persisted object. Unstructured is web and free-form text, etc.

Mnesia (as a pioneer of table-storage like databases) stores data in set’s and bags. A set is a table, which each record has a unique key. Fair enough, we are used to work with table with primary key which is the same. But a bag, is a table in which many records can share a key, hence there might be no way to access a single row of a table because it does not have a unique key (You may say now, WTF? what happens to my candidate keys and primary keys – and my answer is wait a minute. We are not in relational world, so non of these terms exist here).

So what is the value of having a row in a table which we can not access it directly? It of course has some value. Bearing in mind again that table storage is not relational, a good design paradigm is to NEVER query anything except the key (and of course partition key for table storage). Any other query (which is not bounded to partition key for table storage) is similar to a full table scan in you SQL Server database and full table (or index) scan is is THE killer. You can never become scalable if you have a single operation with full table scan over your growing data.

to be continued…