Prevent Un-authorized modifications in SAF

There is an issue currently with SAF (Simple Authorization Framework) when it is used for entities in Entity Framework. The issue is that when SAF reads an entity and tests it against the user principal, it would modify the properties which user has no read access to null. There are also properties that uses hos no edit access which Saf can notify the client about it but does not do anything at this stage to prevent un authorized edit on objects with partial permission.

For example if you modify a book object, which you can edit the title, but are not allowed to edit the price, there is no check on the server to make sure you haven’t done so. I call this partial edit access. If you don’t have edit access to the book object, then Saf prevents you from changing any thing.

Supporting partial edit access for entity framework is not hard, but you should be very cautious of doing so due to possible performance implications. In a 3 tier architecture, you don’t keep the object context alive for the whole session, thus, the original object is not kept anywhere when the service is called back for an update. The way EF ensures optimistic concurrency is that it compares all values inside the sql update statement for the fields with CuncurrencyMode set to Fixed to ensure data is not modified in the background, if there has been any modification to the data, a conflict exception is thrown and user has the choice to resolve the conflict.

However, we don’t have the luxury of implementing the authorization checks to the SQL statement, because Saf does not know how to create Sql queries. Instead, we have to load the actual object in the datastore and ensure that modifications are valid. This is possible by using GetObjectByKey method on the ObjectContext:


var oc = ...// the object context
var storeObject = oc.GetObjectByKey(((IEntityWithKey)modifiedObejct).EntityKey);
//Compare properties of the storeObject with modifiedObject to ensure no un- authorized
//change has happened

Above code can compare the store values with current values for properties of which user has no edit permission. Obviously there is a performance hit problem with this approach because an extra query is sent to the database for each modification that involves partial edit or view access on an object. There is also another issue with this approach; the logic here conflicts with the logic of optimistic concurrency check. For example if user has no view access to the price of the book, Saf will put null in the price field. When data is sent back to the server, context assumes that the object is modified and tries to write back the null value. If we replace the null with current store value, un-authorized properties are changed back, but what happens if someone else has changed these properties before? There should be a conflict, but we have overwritten that conflict with our authorization logic.

Therefore, I am still no sure if I should implement this as part of the Saf solution.

Alternative solution is to check the modified properties using some code similar to this:

var unauthorizedProps = ...; //List of unauthorized properties
var modiProps = objContext.ObjectStateManager.GetObjectStateEntry(modifiedObj).GetModifiedProperties().ToList();
//Mark the object unchanged
objContext.ObjectStateManager.ChangeObjectState(modifedObj, EntityState.Unchanged);
foreach(var prop in modiProps.Except(unauthorizedProps))
{
objContext.ObjectStateManager.GetObjectStateEntry(modifiedObj).SetModifiedProperty(prop);
}

Above code would find all the modified properties, filters out just authorized ones, sets the entity as unmodified and markes the authorized props to be modified. This is much better performer than the previous approach. One of these approaches may eventually find their way through Saf if there is no better way to manage this issue.

Advertisement

Where should I write join conditions? In the ON clause or in the WHERE clause?

What is the difference between these two statements?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

And the following query?


SELECT * FROM Cars c
INNER JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = 'V8'

As you can see, both above queries return a join of cars and resellers for V8 cars. And obviously both queries will return same results. But does it mean that there is no difference between these two way of limiting the results by conditions?

Although for inner join you can use both approaches interchangeably, for outer join a subtle difference can catch you. What would you expect to be the outcome of the following query?


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId AND c.Engine = "V8"

If you expect to get all the cars with V8 engine left joined to the resellers, you won’t be happy with the result because if there is a car that is not a V8 but has no resellers, will be reflected in the results! You have to move c.Engine=’V8′ to the WHERE clause to guarantee the correct behaviour..


SELECT * FROM Cars c
LEFT JOIN Resellers r ON c.CarId = r.CarId
WHERE c.Engine = "V8"

Reason for this behaviour is that Sql Server will first apply the outer join predicates and then reverts rows that have no right side representation. To understand the condition better, think about the way you write a left join in Linq.


var res = from c in Cars
join rs in Resellers on c.CarId equals rs.CarId
from r in rs.DefaultIfEmpty()
select ...;

Left join in Sql server is also calculated the same way. In the above example you first run the join between Cars and Resellers, then select Null when the resellers collection is empty for the join (i.e. rs.DefaultIfEmpty()).

A little puzzle

A chicken and a half lay an egg and a half in a day and a half. How many eggs would one chicken lay in three days?

Help: 3 is not the correct answer.

Top K Query Processing, or FULL SCAN IS THE DEVIL! avoid it

Almost always, users wouldn’t want to see all the results coming back from query. The query result set is usually restricted by conditions and first (or first few) page(s) of response is all user wants to see. If you ask me to describe db optimization in a sentence, I would say the sentence is: avoid full table/index scans. This is the single key to highly performing db applications. However, avoiding full scans is not trivial at all. In fact, it is a hard problem. If your table has n records, full table scan has complexity of O(n). A join between two tables with full scans will have the complexity of O(n^2) and so on. It can get out of control pretty easily.
Have you been in a situation that a tiny 10GB database with just a few million records takes 10 minutes to respond to a query?
Ever wondered how google returns search results from its super massive tables for millions of user on eack keystroke? No matter how expensive your hardware is and how much processing power you have, your machine will be on its knees when a complex query full scans massive tables for a bunch of queries. You have to know your data very well, and you have to know your users (or use cases) very well. Do whatever to avoid full scans on large data sets specially if the query is run frequently. Obviously there is no problem to run a query that takes 5 minutes once every weekend but never a frequently used query should take more that a few seconds. There are heaps you can do to avoid full scans and it is well covered under query processing resources. Read this book to learn some techniques to do such stuff in Ms Sql Server 2008.

Despite all the powerfull tools and smartness that SQL Server or any other commercial DBMS provides, there are cases that top k query processing is not possible at all with just writing SQL. That is when you need to know about the algorithms. Yes! some computer science fun! Top k query processing is a widely studied subject, and there are heaps of methods and techniques that can super drastrically improve the application performance! Unfortunately most comercial DBMSs do not support even easiest top k query processing techniques and algorithms. Hence, you have to code them yourself if you decided to use them. A few top k query processing techniques like TA and PRA are quiet quick and easy to develop but some others are harder and much more complex. This 2008 Acm Survey is an excellent source to gain good understanding of top k query processing thanks to the guys at Waterloo university, ON, Canada.

To be precise. my message on this post is that there is no excuse for long running queries!

Are you a super smart senior consultant? See if you can pass my interview.

I have been looking in the market to change my job in last few weeks and have faced very interesting stuff that is worth writing a book, at least a short story about.

I have seen a christian who wanted to develop the ultimate database inspired from Jesus, have been to interviews where the company has not actually had a position yet, and I have faced people who think are super smart senior consultants and know how to avoid an object being finalized in garbage collector.

If you are one of those super smart senior consultants who knows everything see if you can answer this very basic classic database question. If you can’t you should seriously rethink about your title. Seriously, I can’t  believe someone developing business application for more than a decade and not being able to solve this problem in an interview:

Q. We have a couple of tables in a database. First table has column X and second table has column Y. First table has a foreign key relationship to the second table. What do you do to calculate “SELECT TOP 1 * FROM X JOIN Y … ORDER BY a*X+b*Y DESCwithout a full scan on any table; assuming a and b are arbitrary positive numbers provided by user each time? You don’t necessarily have to solve it with only SQL and you are allowed to use C#. Above query is only for defining the problem.

Justification: Why I think this can be a valid interview question? Two reasons: 1) Think how many time you have faced a client complaining from performance and how many time you where in a situation that knowing how to avoid an object from being finalized makes the customer happy? 2) How easy is it to get the knowledge of algorithms for fast data processing on site in contrast to some rare technical terminology? Google avoiding an object from being finalized and you quickly get your answer in a few seconds but I would be interested to know how you use google to solve this problem?