## An easy but effective Data Quality algorithm for abnormality detection in real numbers

Most of the focus of the DQ profiling commercial tools is around cleansing “Dimension” data (as referred in Data Warehouse terminology). However, the quality of facts is almost as important as dimensions. In this article I want to suggest a heuristic for identifying outliers in fact data which is infinitely better than nothing! – Anything is infinitely better than nothing, mathematically speaking.

The problem is called, abnormality detection. To be more specific, I am talking about outlier detection. What does that mean to the user?

Example: Real-estate data. There have been reports of an American citizen receiving a \$200,000 tax bill for its 3-4 bedroom house in an average suburb. I couldn’t find the original article, so you should trust me on this. If you don’t want to trust me (which is the right thing to do) imagine similar problems that I am sure you have encountered.

The tax man has clearly issued an outlier for the specific sub-class of houses, e.g. a 3-4 bedroom in an average suburb. For such house, the tax should be something between \$700-\$2000. A \$200K tax is a significant number obviously, but a good application should point out the outlier even if the tax is slightly out of order, e.g. \$2500 for a house which should be taxed a bit less in that range.

Solution: Write a little algorithm, that learns the distribution of “fact” values in regards to the condition over several other dimensions. Excuse me for using the Data Warehouse methodology (Fact, and Dimension) instead of the usual machine learning methodology (e.g. features). I think the DW methodology makes more sense here, and I don’t want to justify it, so go ahead and replace the terms with your favorite ones.

Algorithm:

1 – Discover facts, and dimensions. From data-type and their distribution, i.e. count(distinct(fact)) / count(fact) is close to 1. Alternatively, you can ask user to identify this.
2 – Filter out time and dates, as they don’t help us much in this setting. A recurring date dimension, like DayOfWeek, or IsHoliday, can be very useful though.
3 – For every Fact do:
3 – a ) For every Dimension do:
3 – a – i ) Measure the statistical distribution of filtered data limited with the dimension. Specifically measure Count, Mean, Variance, Min, and Max.
3 – a – ii ) Store the above statistics in a file along with the selected dimensions, IF Count > 30 and the sqrt(variance) << max – min.
3 – a – iii ) Recurse to (3 – a) and include more dimensions in the condition.
4 – Print out the rules discovered in 3 – a – ii.

Above algorithms is not optimized for performance, but in the case of DQ, who cares about performance? Just run in your Hadoop cluster .

Output of the above algorithm is a set of rules like: For tax: 3-bedroom, and house, and average suburb, variance is … and mean is …, which means the values are expected to be between 700, and 1500. Your application can read these rules and apply it the data, or user interface to help users fix/avoid their outliers.

Evaluation: Can’t publish customer’s data, but I’ll do it on some public data, later. Only if people ask.

## Cofee shop closing down.

Based on a recent law, every cofee shop operating in Tehran has to install cameras monitored by government. These emotional photos depict the last day at Cafe Perague, which did not accept to install cameras and had to close down.

## Quickly Count Number of Lines of Code in Your .Net Project

This new year holidays, my wife was busy working and getting ready for her IELTS exam, so there was no vacation. I took this chance to work on a project I have been dreaming about last year long.

The holidays are over and I have to go back to my daily job from monday, so I decided to count the number of lines I’ve been writing this christmas. This could give me in indication of how much fun I have these days. So I opened the powershell console and ran the following commands. My project doesn’t have much generated code, so I can say the results seem to be a rough estimate of reality.

```
cd \ProjectDirectory
\$cs = ls -recurse -filter *.cs | gc | where {\$_.Trim().Length -gt 5}
\$sql = ls -recurse -filter *.sql | gc | where {\$_.Trim().Length -gt 5}
\$view = ls -recurse -filter *.cshtml | gc | where {\$_.Trim().Length -gt 5}
\$cs.Count + \$sql.Count + \$view.Count

1,234,980

```

Of course the above number is just a joke. I couldn’t possibly write a million lines in a couple of weeks even if I’ve had my keyboard connected to the treadmill!

Wish you all a successful and merry 2013.

## Easiest way to remember Covariance and Contravariance

I have seen people struggling to remember the definition of co and contra variance. Below are a few lines of code that help you keep it in mind. Just ignore the fancy names but remember that in .Net we use <in T> and <out T> instead of the fancy names when defining the type of variance allowed in generics. That should be enough for 100% of times.

```
BUT:
CollectionAdded<Animal> is NOT CollectionAdder<Cat> // Can not add animals to a list of cats (animals can be dogs)

out: CollectionGetter<Cat> is NOT CollectionGetter<Animal> // Can not read cats out of a list of animals (animals can be dogs)
BUT:
CollectionGetter<Animal> is CollectionGetter<Cat> // Can read a bunch of animals out of a list of cats, all cats are animals by the way
CollectionGetter<Animal> is CollectionGetter<Dog>

```

## Do I have an evil eye? (A bayesian proof that I don’t)

My mom says that I have an evil eye. Her claim is based on some evidences. One of the evidences are as follows:

Once I was in car, in a highway with my family. A nice car took us over speeding. I said something about the car, and a few minutes later we saw the car had an accident at the side of the highway.

So does that mean I have an evil eye? Isn’t the probability of someone talking about a car and the car crash a few minute later rare enough to be considered weird?

I would argue NOT! And prove it mathematically. Read on.

Having no additional knowledge, the P of me talking about a car and the car crashes in a few minutes is close to zero. However, there are a few more points to consider and the situation is not that simple. I would re-phrase the problem as:
“Having a speeding car that passed me and has had an accident, what is the probability that I have talked about it?”

Let’s gather some information. Please note that we limit the world to the cars that have passed me in the highway during my life:

First, the car that has had an accident has been speeding. Let’s say that P of a car that has accident in highway, have also been speeding is about 0.4. It is reasonable to assume most of the cars that crash in highways have been speeding. We call this P(S|X) = 0.4 (X for accident and S for speeding).

Second, I have talked about a car that took us over in highway, speeding. Actually I know this part! The P of me saying something about a car that takes us in a highway speeding is about 0.1. Let’s call this P(T|S) = 0.1. (T for talk and S for speeding).

Last thing I need is the P of a car being speeding having I talked about it in highway. P(S|T) which by experience I can say it is about 0.15.

So, the final question is formulated as what is the probability of me, speaking about a car having it has been speeding, and have had an accident?

Using the Bayes rule, we come up with the following formula:

P(T|S,X) = [P(T|S).P(S|X,T)] / [P(T|S).P(S|X,T) + P(~T|S).P(S|~T,X)] = [0.1*0.4*0.15]/[0.1*0.4*0.15+0.9*0.4*0.85] = 0.019

So, the probability of the accident which my mom considered extremely rare is actually more than 1.9%! Not so rare all of a sudden!

## An IoC Container for Dart (with Autofac like API)

I enjoy throwing Darts, and have been using it quiet a bit recently. The OO nature of Dart makes it suitable for most OO design patterns. I use the inversion of control patter almost everywhere, but Dart did not have a good IoC container library that I knew of. Hence, in order to give something back to the community I wrote a lightweight IoC container based on the API of my favorite “Autofac”. The main reason for this choice was that a C# programmer who has already worked with Autofac can start using “Autodart” in a few minutes!

Below is a sample of how Autodart can be used:

```class Baz{
}

class Bar{
Baz baz;
Bar(this.baz);
}

class Foo{
Bar bar;
Foo(this.bar);
}
void main(){
var b = new AutoBuilder();
b.register((c, p) => new Baz())
.As("Baz")
.singleInstance();

b.register((c, p) => new Bar(c.resolve("Baz")))
.As("Bar")

b.register((c, p) => new Foo(c.resolve("Bar")))
.As("Foo"); //Default is .instancePerDependency()

var c = b.build();

var foo = c.resolve("Foo") as Foo;
var foo2 = c.resolve("Foo") as Foo;

assert(foo != foo2);
assert(foo.bar === foo2.bar);
assert(foo.bar.baz === foo2.bar.baz);

/*
Other supported API include:
b.register(..).As(..).externallyOwned();
c.resolveLazy("Foo") as Lazy<Foo>;
c.dispose(); //Disposes all disposables in the scope
*/
}
```

Autodart supports the following scopes: SingeInstance (Singleton), InstancePerLifetimeScope (Nester Scopes), and InstancePerDependency.

It also defines the Disposable concept, and manages the scopes for Disposable objects.

Shortcomings: There is no auto wiring, as Dart does not yet support reflection. Also there is still no “Type” type in Dart, so string type names are used as keys. I would update the library as soon as dart adds support for reflection.

## Pro-active Sql Injection defence

This link has a guide for hackers to do SQL Injection attacks. I summarize different type of attacks to SQL Server briefly:

1.Commenting : — /* */
Code: “SELECT * FROM Field WHERE Id=” + id + ” AND TYPE=-1″
Attacked: SELECT * FROM Field WHERE Id=1; SUTDOWN; — AND TYPE=-1

2. Stacking : …; …
Code: “SELECT * FROM Field WHERE Id=” + id
Attacked: SELECT * FROM Field WHERE Id=1; SHUTDOWN;

3. Always True Condition : 1=1
Code: “SELECT * FROM SecretStuff WHERE user = ‘” + UserId + “‘”
Attacked: SELECT * FROM SecretStuff WHERE user = ‘baduser’ OR 1=1 –’

4. Addition of Column: … + SomeOtherColumn + …
Code: “SELECT Name, BadPractice = ‘” + someParam + “‘ FROM Users “
Attacked: “SELECT Name, BadPractice = ‘ ‘ + Passwod +’ ‘ FROM Users “
Attacked: “SELECT Name, BadPractice = ‘ ‘ + (SELECT Password From Users WHER User = ‘Admin’) +’ ‘ FROM Users “

5. Union : … UNION …
Code: “SELECT Name, Family FROM Customers WHERE CustomerId = ” + id
Attacked: “SELECT Name, Family FROM Customers WHERE CustomerId = 1 UNION Select UserName, Password FROM Users”

Pro-active prevention:

User parametered queries instead of directly building queries as strings.

Re-active prevention when pro-active is not possible (e.g. lots of legacy code) or just for an additional security. This check happens right before running the query:

a. First take out all the quotes and identifiers: ‘…’ and [...]

b. Run the following regex against the remaining of the query: new Regex(“–|\/\*|\*\/|\W+UNION\s+”, RegexOptions.IgnoreCase) which does:

Check queries for comments before running, catches 1

Expect union only when you know you want union. Catches 3

c. Count number of semicolons which does

Check the number of stacked queries and confirm with expected. Catches 2

In the swenging project, I have written a function that does above tasks (as a samples for using swe), in the context of swe state machine. It can be used as follows:
``` Assert.AreEqual( ValidationResult.Comment, //A commont injection found ValidateQuery(@"SELECT STUFF from Where I shouldn't' /* some Comment *."));```

``` ```

```etc.. ```