Pro-active Sql Injection defence

This link has a guide for hackers to do SQL Injection attacks. I summarize different type of attacks to SQL Server briefly:

1.Commenting : — /* */
Code: “SELECT * FROM Field WHERE Id=” + id + ” AND TYPE=-1″
Attacked: SELECT * FROM Field WHERE Id=1; SUTDOWN; — AND TYPE=-1

2. Stacking : …; …
Code: “SELECT * FROM Field WHERE Id=” + id
Attacked: SELECT * FROM Field WHERE Id=1; SHUTDOWN;

3. Always True Condition : 1=1
Code: “SELECT * FROM SecretStuff WHERE user = ‘” + UserId + “‘”
Attacked: SELECT * FROM SecretStuff WHERE user = ‘baduser’ OR 1=1 –‘

4. Addition of Column: … + SomeOtherColumn + …
Code: “SELECT Name, BadPractice = ‘” + someParam + “‘ FROM Users ”
Attacked: “SELECT Name, BadPractice = ‘ ‘ + Passwod +’ ‘ FROM Users ”
Attacked: “SELECT Name, BadPractice = ‘ ‘ + (SELECT Password From Users WHER User = ‘Admin’) +’ ‘ FROM Users ”

5. Union : … UNION …
Code: “SELECT Name, Family FROM Customers WHERE CustomerId = ” + id
Attacked: “SELECT Name, Family FROM Customers WHERE CustomerId = 1 UNION Select UserName, Password FROM Users”

Pro-active prevention:

User parametered queries instead of directly building queries as strings.

Re-active prevention when pro-active is not possible (e.g. lots of legacy code) or just for an additional security. This check happens right before running the query:

a. First take out all the quotes and identifiers: ‘…’ and […]

b. Run the following regex against the remaining of the query: new Regex(“–|\/\*|\*\/|\W+UNION\s+”, RegexOptions.IgnoreCase) which does:

Check queries for comments before running, catches 1

Expect union only when you know you want union. Catches 3

c. Count number of semicolons which does

Check the number of stacked queries and confirm with expected. Catches 2

In the swenging project, I have written a function that does above tasks (as a samples for using swe), in the context of swe state machine. It can be used as follows:

Assert.AreEqual(
ValidationResult.Comment, //A commont injection found
ValidateQuery(@"SELECT STUFF from Where I shouldn't' /* some Comment *."));

etc..

Advertisement

Ever wondered how text search works? Let’s do it!

You have certainly seen the fuzzy text searches everywhere. For example you type “stck” but you actually mean “stack”! Ever wondered how does this stuff work?

There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. “q” in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

2-gram means that every word is broken into a set of two character grams. “Stack” will be broken into a set of {“st”, “ta”, “ac”, “ck”} or “database” will be broken into {“da”,”at”,”ta”,”ba”,”as”,”se”}.

Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped “stck”, any search for “stck” will not match “stack” because “a” is missing, but the 2-gram set {“st”,”tc”,”ck”} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of “star” so we can easily suggest the user that he meant to type: first “stack” or second, “star”.

Now let’s implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))

Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

--make millions of 2grams
 DECLARE @i int =0
 WHILE (@i<5000000)
 BEGIN
-- a random 2gram
 declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
 declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
 INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
 END

Now lets query the word “stack” which will be broken to: {‘st’,’ta’,’ac’,’ck’} two grams.

DECLARE @word TABLE(twog char(2)) -- 'stack'
 INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')

select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
 GROUP BY wordId

You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

Where should I write join conditions? In the ON clause or in the WHERE clause?

What is the difference between these two statements?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

And the following query?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = 'V8'

As you can see, both above queries return a join of cars and resellers for V8 cars. And obviously both queries will return same results. But does it mean that there is no difference between these two way of limiting the results by conditions?

Although for inner join you can use both approaches interchangeably, for outer join a subtle difference can catch you. What would you expect to be the outcome of the following query?


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

If you expect to get all the cars with V8 engine left joined to the resellers, you won’t be happy with the result because if there is a car that is not a V8 but has no resellers, will be reflected in the results! You have to move c.Engine=’V8′ to the WHERE clause to guarantee the correct behaviour..


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = "V8"

Reason for this behaviour is that Sql Server will first apply the outer join predicates and then reverts rows that have no right side representation. To understand the condition better, think about the way you write a left join in Linq.


var res = from c in Cars
join rs in Resellers on c.CarId equals rs.CarId
from r in rs.DefaultIfEmpty()
select ...;

Left join in Sql server is also calculated the same way. In the above example you first run the join between Cars and Resellers, then select Null when the resellers collection is empty for the join (i.e. rs.DefaultIfEmpty()).

Top K Query Processing, or FULL SCAN IS THE DEVIL! avoid it

Almost always, users wouldn’t want to see all the results coming back from query. The query result set is usually restricted by conditions and first (or first few) page(s) of response is all user wants to see. If you ask me to describe db optimization in a sentence, I would say the sentence is: avoid full table/index scans. This is the single key to highly performing db applications. However, avoiding full scans is not trivial at all. In fact, it is a hard problem. If your table has n records, full table scan has complexity of O(n). A join between two tables with full scans will have the complexity of O(n^2) and so on. It can get out of control pretty easily.
Have you been in a situation that a tiny 10GB database with just a few million records takes 10 minutes to respond to a query?
Ever wondered how google returns search results from its super massive tables for millions of user on eack keystroke? No matter how expensive your hardware is and how much processing power you have, your machine will be on its knees when a complex query full scans massive tables for a bunch of queries. You have to know your data very well, and you have to know your users (or use cases) very well. Do whatever to avoid full scans on large data sets specially if the query is run frequently. Obviously there is no problem to run a query that takes 5 minutes once every weekend but never a frequently used query should take more that a few seconds. There are heaps you can do to avoid full scans and it is well covered under query processing resources. Read this book to learn some techniques to do such stuff in Ms Sql Server 2008.

Despite all the powerfull tools and smartness that SQL Server or any other commercial DBMS provides, there are cases that top k query processing is not possible at all with just writing SQL. That is when you need to know about the algorithms. Yes! some computer science fun! Top k query processing is a widely studied subject, and there are heaps of methods and techniques that can super drastrically improve the application performance! Unfortunately most comercial DBMSs do not support even easiest top k query processing techniques and algorithms. Hence, you have to code them yourself if you decided to use them. A few top k query processing techniques like TA and PRA are quiet quick and easy to develop but some others are harder and much more complex. This 2008 Acm Survey is an excellent source to gain good understanding of top k query processing thanks to the guys at Waterloo university, ON, Canada.

To be precise. my message on this post is that there is no excuse for long running queries!

Are you a super smart senior consultant? See if you can pass my interview.

I have been looking in the market to change my job in last few weeks and have faced very interesting stuff that is worth writing a book, at least a short story about.

I have seen a christian who wanted to develop the ultimate database inspired from Jesus, have been to interviews where the company has not actually had a position yet, and I have faced people who think are super smart senior consultants and know how to avoid an object being finalized in garbage collector.

If you are one of those super smart senior consultants who knows everything see if you can answer this very basic classic database question. If you can’t you should seriously rethink about your title. Seriously, I can’t  believe someone developing business application for more than a decade and not being able to solve this problem in an interview:

Q. We have a couple of tables in a database. First table has column X and second table has column Y. First table has a foreign key relationship to the second table. What do you do to calculate “SELECT TOP 1 * FROM X JOIN Y … ORDER BY a*X+b*Y DESCwithout a full scan on any table; assuming a and b are arbitrary positive numbers provided by user each time? You don’t necessarily have to solve it with only SQL and you are allowed to use C#. Above query is only for defining the problem.

Justification: Why I think this can be a valid interview question? Two reasons: 1) Think how many time you have faced a client complaining from performance and how many time you where in a situation that knowing how to avoid an object from being finalized makes the customer happy? 2) How easy is it to get the knowledge of algorithms for fast data processing on site in contrast to some rare technical terminology? Google avoiding an object from being finalized and you quickly get your answer in a few seconds but I would be interested to know how you use google to solve this problem?

Automatic Mid-tier Cache!

I have been thinking about this for a while. I started to work on a Silverlight project last year which was a nifty little business app. One of those apps that is just sitting down and working, you know, a web server, an average DB with a few million records, and a silverlight client with a bunch of forms, grids, and menus.

Everything was straightforward and brainless, like just follow a pattern and do the job. However, I was deeply dis satisfied from one aspect of the project. The Caching. What I could not accept was that in 21st century, when you have Entity Framework and IQueryable, you should still manually cache your data when it is appropriate and do all the pointless work of cache invalidation and loading, etc. Apart from the pain of working on something that should have been automated, I wouldn’t trust a programmer to decide which parts of the data should be cached and which part shouldn’t be. Not that I don’t believe they can do a good job on that, they don’t have enough information (at the time of dev work) to decide on it.

Caching strategy should be based on user behaviour and is subject to change by passage of time. For example at some stage lots of QLD pharmacies are queried, but next week NSW users decide to get ready for their conferences and start hammering the system for NSW pharmacies.

Le me be clear about my expectations of a caching system. It should have the following charachteristics:

  1. It should know what users are going to query a lot and cache that (and only that) part of the database.
  2. It should be able to re-use the caches. For example if I say 1.”Give me all QLD pharmacies”, and next one says 2.”Give me all QLD Chemists Warehouses”, the cache manager should be smart enough to run this new query 2., over the results of query 1. which has been retrieved a few minutes ago.
  3. It should optimize the indexes for performance based on the user queries.
  4. It should change the cache when user behaviour changes.
  5. It can call back the database only if there is absolutely no way of answering the query from cache.

Above requests seems to be a lot, but not really in 2011. All these methods are possible, in fact DBMSs do those kind of stuff for ages. We also have IQueryable, which makes it even easier to have a decent caching system.

So let me write a few examples:

Q1: Pharmacies.Join( … Address …).Join( … State …).Where( s => s.Sate = “QLD”).Select(…)

Q2: Pharmacies.Join( … Address …).Join( … State …).Where( a =>a.Sate = “QLD” && a.PostCode > 4000 && a.PostCode<4079).Select(…)

Q3: Pharmacies.Join( … Address …).Join( … State …).Where( s => s.Sate = “QLD”).GroupBy(…).Where( pg => pg.Count() > 4).Select(…)

Q4:  PharmacyStaff.Where(ps => ps.Position == “Manager”).Select(…)

Q5: Pharmacies.Join( … Address …).Join( … State …).Join(…PharmacyStaff…).Where( s => s.Sate=”QLD” && s.Position == “Manager” ).Select(…)

Users login to our system  and do stuff that will cause the above queries to be issued. Normally they will all be issued against the database, but it means that our caching strategy is stupid as a donkey. Really what I would expect is that only Q1 and Q4 are ran against the database. Q2, Q3, and Q4 are all subsets of Q1, hence if we already have those results from Q1, such a waste to run these new queries against the database. Why not look at the Expression Tree and figure out that Q2 is forms a query which is a subset of Q1. Then change the queries as below:

Q1: not changed…

Q2: Q1.Where(a => a.PostCode > 4000 && a. PostCode<4079).Select(…)

Q3: Q1.GroupBy(…).Where(…)

Q4: not changed…

Q5: Q1.Join(…Q4…).Select(…)

Check out the above queries. Aren’t they much better. We don’t expect user or programmer to waste his time on translating those queries. The caching system should do that. It should be an IQueryable that reads the ExpressionTree and translates it into a new ExpressionTree that uses existing data in the cache if there is no need to query database.

This specially make  sense in CLOUD, where you have to pay for querying your SQL Asure.

Enough talking about the dreams, lets become realistic! I did a bit of research and as I expected no such caching manager exists (if you know some tell me and save my hair). So I decided to do it myself. Check the Auto-mid-tier-cache project which I have already started. I haven’t gone far with it yet. It is just a proof on concepts and it implements no IQueryable. It uses a set of objects defined by myself for Relational Algebra operators. It does the very basic of view-matching to find what query is subset of what other, and it is able to translates queries to run against the database or cache alternatively.

I ran it and it worked fine and a bunch of benchmark analysis proved its effectiveness. What is left now is to complete the view-matching and write an IQueryable on top of it. Lot of work but it is worth it.

I forgot to say that you can limit the cache size by setting cost upper-bound. Next issue is that it does not keep itself up-to-date, but this is really another story.

Column Store Databases

Most RDBMS systems that are common, i.e. SQL Server, Oracle, etc. store data in rows. Indeed a bunch of rows form a page and a page is virtually the unit block that can be loaded into memory. This model is helpful when large amount of data should be store on disk. Due to slow speed operation of disk you want to keep related information physically as close as possible.

This also has a dark side to it by making redundancy in data. One obvious thing is that most of the data that we use comes from a limited domain, for example list of cities, post codes, countries. Names, area codes, etc. We usually don’t care about storage much since storage is the cheapest thing to get these days, yet it comes with the price of performance.

If you want to count the number of distinct countries in your database of 10,000,000 records, you need to pass through the whole 10M. Then if you are trying to run expensive categorization, ontology extraction, etc on multiple column you have to deal with tons of duplicate values which slow you down.

Also if you want to change schema of a data sets dynamically, you have to deal with extra complexity. That is when column store data stores come to rescue with dramatic performance improvement. Yet there are certain tasks that will be very slow on such data bases.

Column store databases like MonetDB store data domain in column, and maintain relationships as pointers to these data. This is much like traditional way of storing data in memory where you have your actual data objects somewhere but you organise a List<T> as a list of pointers to your data. You just keep one copy of month names and everything else is pointing to that.

UPDAE — MonetDB has an interesting property that it is designed to break the columns into sizes that fit into the CPU cache. The “Memory Wall” is a big problem for most modern DBMSs as random access to large pages of data which do not fit in the CPU cache significantly reduces the performance of the data processing. As column store databases work very well with bulk operations, and optimised set of relational algebra operators called BQA (Bulk Query Algebra).  Bonk’s group at CWI, Amsterdam who have developed MonetDB have an interesting little paper that describes the history and future of their work. I suggest you should take a look at their paper (Download it from google scholar).

Although it provides flexibility and good performance hit in memory, the story turns back on disk. Indeed the performance for getting data from one relation is as if you had the relation joined to another table for every single column of it.

I think column storage is a very good candidate for caching data. I have been playing with this concept in the auto-midtier-cache project a bit, you can also take a look.

DBLP for SQL Server

DBLP is the mainstream computer science publication database. Everybody can download its dataset from here for free.  It is a good benchmark dataset for many algorithmic research purposes. For example if you work on an entity resolution algorithm, you can benchmark it against this dataset which includes numerous data quality issues.

I have used this data set multiple times for my own work, however, for someone like me which uses a commercial RDBMS, a little bit of work is required to utilise this dataset. The original dataset which you can download from the above link is an XML file suitable for a column store database (link MonetDB). However, most commercial DBMS are table store and you should find a way to make this xml suitable for your DBMS.

Although it is not particularly hard to implement and takes only a few hours to figure out what to do and less than an hour to implement it, I put my solution HERE which may save you a few hours.

It basically converts dblp’s cml to flat csv’s that you can import to your DBMS. A copy of SQL Server 2008 R2 backup file is also included in the solution for your convenience.

(Update: Sorry I removed the SQL Server backup file because it was slowing down the checkout)

To use the code, you should first check out the code from the project folder (using svn). There is a project called Dblp_xml2csv, which is a Console Application. Set this project as the startup project and run the Console App.

You need to have downloaded DBLP dataset in the xml format. Go to where dblp_xml2csv.exe is built and run the following command:
>dblp_xml2csv.exe dblp-data.xml

You can replace dblp-data.xml with the path of the dblp xml file.

There is no need for the companion xsd, because for performance reasons Xml dataset is treated as a text file.

Running the above command will create a couple of .csv files. You can use the csv file to import to SQL server or Excel, etc.

Additional stuff in the solution are related to some work I was doing on a DQ estimation technique and would not be much helpful to you probably.

View Matching

View Matching is a fairly old technique first seriously utilised at the beginning of the millennium. Lars et al from Microsoft research exploited this technique for SQL Server 2000.

The idea is to utilise existing materialised views (or indexed views in SQL Server) for query optimisations. For example, if you query “SELECT * FROM Cars WHERE Brand=’BMW’ and PRICE BETWEEN 40,000 AND 70,000” and there is already an indexed views over the cars called v_LuxuryCars which is defined as “SELECT * FROM Cars WHERE PRICE > 40,000”, the query optimiser can exploit v_LuxuryCars for executing the query. This view can be much smaller than the base table and the query can run faster. The real beauty of exploiting materialised views will be obvious in more complex queries though.

The challenge here is how to efficiently figure out if a view can be used for the query. Indeed, if the query predicate, is a subset of the view definition predicate (like the given example; PRICE BETWEEN 40k AND 80K is a subset of PRICE > 40K). This is not straight forward since the predicates can be really complicated. For example query may have the predicate A=2 and B=2, while the view definition has the predicate A=B. In that case the inference of the predicate sub/super relationship is  not obvious).

However, this problem is well studied for relational algebra and even though Microsoft have been very active in developing View Matching techniques, it is not used in the SQL Server core, possibly for performance reasons.

The reason I am interested in view matching is that I have started a project for View Matching over IQueriable interface. I will keep you posted about the progress.

Sampling Data with SQL Server

Larson et al. from Microsoft Research have this paper in ACM SIGMOD 2007. They introduce an interesting sampling mechanism over SQL Server using materialized views. Materialized views are the same as indexed views. If you create an indexed view over a table or a bunch of tables joined together, using their technique, you have a tidy sample for your data which is kept up-to-date automatically.

What caught my eyes was that their base sampling approach is interestingly simple:


CREATE VIEW Sample_TheTableName WITH SCHEMA_BINDING AS
  SELECT ...,
        __RAND = (CAST( ROW_NUMBER()  OVER (ORDER BY RAND()) AS tinyint)
  FROM ...

The additional __RAND column gives a random distribution on data which can give us a statistically valid sample.