Hidden Markov Models, Bayesian Statistics, and Quantum Computing together can estimate inventories

In production systems, organisations that produce things, like mines, factories, refineries, etc. a whole system exists dubbed inventory. The inventory is a graph of materials and objects moving through the systems and changing face. For example, in a coal mine, dirt is digged out of pits in units called lots. The lot moves around the mine, gets merged, crushed, dried, and cleaned. The lot moves from the ground to trucks and process points, ports, railways, etc. where most likely it changes shape and characteristics. Same thing happens in manufacturing, chemicals, water procurements, and many other industries.

It is crucial and critical for the business to know their inventory at each stage. They usually have manual or automatic processes to measure characteristics of every part of the inventory. For example weightometers are fitted all over the system which generate data, and they may manually keep record of what happens on the ground. The data generated is almost always limited, of low quality and is not directly a reflection of the reality. In fact, computer science should come in and start filling the gaps, improve data quality and estimate inventories much better than the most skilled human can do by looking at the inventories and excel sheets. In this post I want to reflect on how three techniques, two from AI, and one from Quantum Computing can provide a set of tools to just do that.

Hidden Markov Models (HMM) are used to extract state machine in temporal data. For example, if we track Joe movements, the fact that Joe  usually gets out of the kitchen, go to the Mary’s desk for five minutes and come back to his own desk, is a Markov chain of states hidden in the data. In inventories HMM can be very well used to extract sate machines. I refer to this paper for a good tutorial on HMMs. In the inventory settings, HMMs can be used to extract movements of materials and objects and the change in their characteristics. For example coal goes to the process point A and becomes dry and the dry coal goes to the crusher and becomes grained, and then goes to another process points and becomes pure by loosing ash. Or again coal goes into a process point and ash comes out. Such MMs can be extracted by looking at the data.

Bayesian statistics is simply the implications of the Bayes theorem: P(A|B) = P(B|A)P(A)/P(B). In computer science a mixture of Bayesian statistics and HMM is used for improving data quality in time series (read a good paper here). It is mainly done by forming correlation tables. For example, Joe always goes from Mary’s office to his own, so the correlation of Joe location between Mary’s office and Joe’s office is high, but he never goes from Mary’s office to the Jack’s office, so the correlation of Joe’s location between Mary’s and Jack’s office is zero. Using this knowledge, if we are in doubt if Joe is currently in his own office or in Jack’s office, we can check to see where has he been before. If he has been in mary’s office beforehand, we can estimate that Joe is probably in his own office. In literature this is called probabilistic easing. The probabilistic easing, however, does not produce an exact value of the location and characteristics of objects, instead it creates a set of probable location with their probability. For example in a coal mine, we may estimate that the lot X is out of the crusher with 20% probability and it is merged with lot Y with 80% probability. This dramatically extends our knowledge of the inventory, however, it is probabilistic, which is ok, because life and reality are probabilistic.

Quantum mechanics may seem irrelevant to a coal mine. However, Quantum mechanics is extremely beautiful, and I love to find a use for some of the formulas and methods coming out of Quantum computing in unrelated fields. There is one specific method in Quantum computing that might be relevant to our settings. It can become very useful to improve our estimations for eased data generated from the previous steps.

The problem of the eased data is of course that it is probabilistic (its both curse and blessing). In reality, inventories get measured occasionally and this measurements infer valuable information that are ignored in the easing approaches. For example, the easing method reports a specific inventory of coal to be with 60% probability high in Ash. The engineers on the ground sample some of the coal in this inventory and send it to lab, and the result comes that the coal has exactly 20% ash. The easing approach uses this fact merely to improve the easing from this point forward. However, quantum algorithms tell us that this fact contains much more information.

In quantum mechanics there is a state which is called entanglement. Entanglement is a very unintuitive phenomenon but we know that it happens. It means once two atoms are entangled, their electron are in the same superpositions, no matter how far apart those atoms are from each other. By the way, superposition is nothing new in this post. Remember we said that easing method says that Joe is with 10% probability in Jack’s office and 90% probability in his own office, this is called super position in quantum mechanics. Just replace Joe with your favorite electron. Now that we figure out the similarity with eased state of inventory and quantum state of matter, we can utilize the work in quantum computing to improve our knowledge of the state of inventory.

In quantum computing, once entangled atoms pass through quantum gates, their state changes, and we can compute their superpositions, which is something similar to Bayesian easing. However, in quantum computing, once we measure the sate of one of the atoms, we break the engagement, and the atom collapses to the new state which we measured. This is very similar to when we send the coal samples to lab and realize the exact state of our coal. In fact, the eased probabilistic state collapses to one value that has come out of lab. In quantum computing however, we use this data to calculate new superposition for all other atoms. For example, if the measurement of the output of the first quantum gate comes out as 001, we know that other gates can not be in an state which is inconsistent with the first gate being in 001. We can re-estimate the superposition of other atoms with some linear algebra. We can do exactly the same in inventory. For example, we can apply this new knowledge to back track and re-estimate the eased sate of all other bits of coal in the system but limit easing only to states that are consistent with the new knowledge out of lab.

This may sound a bit superficial and hard to implement, but remember that the Bayesian easing based on HMM is not my invention and people have been using it successfully. Adding the quantum flavour is not hard even because it is also implemented and used by many others in the computer science field. I don’t see a feasibility problem here. There might be a marketing problem but if you are interesting in implementing above, and you are research student, feel free to drop me an email.

Cost of war in Electric Cars unit

Cost of Iraq and Afghanistan war in about a decade? $1.4E12

Cost of an average electric car today? $4E4

Considering 50% discount reduction when ordered in tens of millions!

Cost of war ~ 70M electric cars. (Half of the U.S. fleet)

Note: 98% of oil is burned in vehicles.

 

Conclusion:

If the money spent on war was spent on electric cars, U.S. would be independent from oil import.

Electric cars would have been affordable already. Therefore, rest of the world would reduce oil consumption.

Therefore, funding for terrorist would have gone. Saudi arabia, Iran, Qatar, etc. wouldn’t have enough surplus to fund terrorism.

And world would have been a much nicer place to live.

An easy but effective Data Quality algorithm for abnormality detection in real numbers

Most of the focus of the DQ profiling commercial tools is around cleansing “Dimension” data (as referred in Data Warehouse terminology). However, the quality of facts is almost as important as dimensions. In this article I want to suggest a heuristic for identifying outliers in fact data which is infinitely better than nothing! – Anything is infinitely better than nothing, mathematically speaking.

The problem is called, abnormality detection. To be more specific, I am talking about outlier detection. What does that mean to the user?

Example: Real-estate data. There have been reports of an American citizen receiving a $200,000 tax bill for its 3-4 bedroom house in an average suburb. I couldn’t find the original article, so you should trust me on this. If you don’t want to trust me (which is the right thing to do) imagine similar problems that I am sure you have encountered.

The tax man has clearly issued an outlier for the specific sub-class of houses, e.g. a 3-4 bedroom in an average suburb. For such house, the tax should be something between $700-$2000. A $200K tax is a significant number obviously, but a good application should point out the outlier even if the tax is slightly out of order, e.g. $2500 for a house which should be taxed a bit less in that range.

Solution: Write a little algorithm, that learns the distribution of “fact” values in regards to the condition over several other dimensions. Excuse me for using the Data Warehouse methodology (Fact, and Dimension) instead of the usual machine learning methodology (e.g. features). I think the DW methodology makes more sense here, and I don’t want to justify it, so go ahead and replace the terms with your favorite ones.

Algorithm:

1 – Discover facts, and dimensions. From data-type and their distribution, i.e. count(distinct(fact)) / count(fact) is close to 1. Alternatively, you can ask user to identify this.
2 – Filter out time and dates, as they don’t help us much in this setting. A recurring date dimension, like DayOfWeek, or IsHoliday, can be very useful though.
3 – For every Fact do:
3 – a ) For every Dimension do:
3 – a – i ) Measure the statistical distribution of filtered data limited with the dimension. Specifically measure Count, Mean, Variance, Min, and Max.
3 – a – ii ) Store the above statistics in a file along with the selected dimensions, IF Count > 30 and the sqrt(variance) << max – min.
3 – a – iii ) Recurse to (3 – a) and include more dimensions in the condition.
4 – Print out the rules discovered in 3 – a – ii.

Above algorithms is not optimized for performance, but in the case of DQ, who cares about performance? Just run in your Hadoop cluster :).

Output of the above algorithm is a set of rules like: For tax: 3-bedroom, and house, and average suburb, variance is … and mean is …, which means the values are expected to be between 700, and 1500. Your application can read these rules and apply it the data, or user interface to help users fix/avoid their outliers.

Evaluation: Can’t publish customer’s data, but I’ll do it on some public data, later. Only if people ask.

 

Cofee shop closing down.

Based on a recent law, every cofee shop operating in Tehran has to install cameras monitored by government. These emotional photos depict the last day at Cafe Perague, which did not accept to install cameras and had to close down.

Quickly Count Number of Lines of Code in Your .Net Project

This new year holidays, my wife was busy working and getting ready for her IELTS exam, so there was no vacation. I took this chance to work on a project I have been dreaming about last year long.

The holidays are over and I have to go back to my daily job from monday, so I decided to count the number of lines I’ve been writing this christmas. This could give me in indication of how much fun I have these days. So I opened the powershell console and ran the following commands. My project doesn’t have much generated code, so I can say the results seem to be a rough estimate of reality.


cd \ProjectDirectory
$cs = ls -recurse -filter *.cs | gc | where {$_.Trim().Length -gt 5}
$sql = ls -recurse -filter *.sql | gc | where {$_.Trim().Length -gt 5}
$view = ls -recurse -filter *.cshtml | gc | where {$_.Trim().Length -gt 5}
$cs.Count + $sql.Count + $view.Count

1,234,980

Of course the above number is just a joke. I couldn’t possibly write a million lines in a couple of weeks even if I’ve had my keyboard connected to the treadmill!

Wish you all a successful and merry 2013.

Easiest way to remember Covariance and Contravariance

I have seen people struggling to remember the definition of co and contra variance. Below are a few lines of code that help you keep it in mind. Just ignore the fancy names but remember that in .Net we use <in T> and <out T> instead of the fancy names when defining the type of variance allowed in generics. That should be enough for 100% of times.


in: CollectionAdder<Cat> is CollectionAdder<Animal> // Can add cats to a list of animals
 CollectionAdder<Dog> is CollectionAdder<Animal>
BUT:
 CollectionAdded<Animal> is NOT CollectionAdder<Cat> // Can not add animals to a list of cats (animals can be dogs)

out: CollectionGetter<Cat> is NOT CollectionGetter<Animal> // Can not read cats out of a list of animals (animals can be dogs)
BUT:
 CollectionGetter<Animal> is CollectionGetter<Cat> // Can read a bunch of animals out of a list of cats, all cats are animals by the way
 CollectionGetter<Animal> is CollectionGetter<Dog>

Do I have an evil eye? (A bayesian proof that I don’t)

My mom says that I have an evil eye. Her claim is based on some evidences. One of the evidences are as follows:

Once I was in car, in a highway with my family. A nice car took us over speeding. I said something about the car, and a few minutes later we saw the car had an accident at the side of the highway.

So does that mean I have an evil eye? Isn’t the probability of someone talking about a car and the car crash a few minute later rare enough to be considered weird?

I would argue NOT! And prove it mathematically. Read on.

Having no additional knowledge, the P of me talking about a car and the car crashes in a few minutes is close to zero. However, there are a few more points to consider and the situation is not that simple. I would re-phrase the problem as:
“Having a speeding car that passed me and has had an accident, what is the probability that I have talked about it?”

Let’s gather some information. Please note that we limit the world to the cars that have passed me in the highway during my life:

First, the car that has had an accident has been speeding. Let’s say that P of a car that has accident in highway, have also been speeding is about 0.4. It is reasonable to assume most of the cars that crash in highways have been speeding. We call this P(S|X) = 0.4 (X for accident and S for speeding).

Second, I have talked about a car that took us over in highway, speeding. Actually I know this part! The P of me saying something about a car that takes us in a highway speeding is about 0.1. Let’s call this P(T|S) = 0.1. (T for talk and S for speeding).

Last thing I need is the P of a car being speeding having I talked about it in highway. P(S|T) which by experience I can say it is about 0.15.

So, the final question is formulated as what is the probability of me, speaking about a car having it has been speeding, and have had an accident?

Using the Bayes rule, we come up with the following formula:

P(T|S,X) = [P(T|S).P(S|X,T)] / [P(T|S).P(S|X,T) + P(~T|S).P(S|~T,X)] = [0.1*0.4*0.15]/[0.1*0.4*0.15+0.9*0.4*0.85] = 0.019

So, the probability of the accident which my mom considered extremely rare is actually more than 1.9%! Not so rare all of a sudden!

An IoC Container for Dart (with Autofac like API)

I enjoy throwing Darts, and have been using it quiet a bit recently. The OO nature of Dart makes it suitable for most OO design patterns. I use the inversion of control patter almost everywhere, but Dart did not have a good IoC container library that I knew of. Hence, in order to give something back to the community I wrote a lightweight IoC container based on the API of my favorite “Autofac”. The main reason for this choice was that a C# programmer who has already worked with Autofac can start using “Autodart” in a few minutes!

Below is a sample of how Autodart can be used:

class Baz{
}

class Bar{
  Baz baz;
  Bar(this.baz);
}

class Foo{
  Bar bar;
  Foo(this.bar);
}
void main(){
  var b = new AutoBuilder();
  b.register((c, p) => new Baz())
    .As("Baz")
    .singleInstance();

  b.register((c, p) => new Bar(c.resolve("Baz")))
    .As("Bar")
    .instancePerLifetimeScope();

  b.register((c, p) => new Foo(c.resolve("Bar")))
    .As("Foo"); //Default is .instancePerDependency()

  var c = b.build();

  var foo = c.resolve("Foo") as Foo;
  var foo2 = c.resolve("Foo") as Foo;

  assert(foo != foo2);
  assert(foo.bar === foo2.bar);
  assert(foo.bar.baz === foo2.bar.baz);

  /*
    Other supported API include:
    b.register(..).As(..).externallyOwned();
    c.resolveLazy("Foo") as Lazy<Foo>;
    var childScope = c.createLifetimeScope();
    c.dispose(); //Disposes all disposables in the scope
  */
}

Autodart supports the following scopes: SingeInstance (Singleton), InstancePerLifetimeScope (Nester Scopes), and InstancePerDependency.

It also defines the Disposable concept, and manages the scopes for Disposable objects.

Shortcomings: There is no auto wiring, as Dart does not yet support reflection. Also there is still no “Type” type in Dart, so string type names are used as keys. I would update the library as soon as dart adds support for reflection.

Pro-active Sql Injection defence

This link has a guide for hackers to do SQL Injection attacks. I summarize different type of attacks to SQL Server briefly:

1.Commenting : — /* */
Code: “SELECT * FROM Field WHERE Id=” + id + ” AND TYPE=-1″
Attacked: SELECT * FROM Field WHERE Id=1; SUTDOWN; — AND TYPE=-1

2. Stacking : …; …
Code: “SELECT * FROM Field WHERE Id=” + id
Attacked: SELECT * FROM Field WHERE Id=1; SHUTDOWN;

3. Always True Condition : 1=1
Code: “SELECT * FROM SecretStuff WHERE user = ‘” + UserId + “‘”
Attacked: SELECT * FROM SecretStuff WHERE user = ‘baduser’ OR 1=1 –‘

4. Addition of Column: … + SomeOtherColumn + …
Code: “SELECT Name, BadPractice = ‘” + someParam + “‘ FROM Users ”
Attacked: “SELECT Name, BadPractice = ‘ ‘ + Passwod +’ ‘ FROM Users ”
Attacked: “SELECT Name, BadPractice = ‘ ‘ + (SELECT Password From Users WHER User = ‘Admin’) +’ ‘ FROM Users ”

5. Union : … UNION …
Code: “SELECT Name, Family FROM Customers WHERE CustomerId = ” + id
Attacked: “SELECT Name, Family FROM Customers WHERE CustomerId = 1 UNION Select UserName, Password FROM Users”

Pro-active prevention:

User parametered queries instead of directly building queries as strings.

Re-active prevention when pro-active is not possible (e.g. lots of legacy code) or just for an additional security. This check happens right before running the query:

a. First take out all the quotes and identifiers: ‘…’ and […]

b. Run the following regex against the remaining of the query: new Regex(“–|\/\*|\*\/|\W+UNION\s+”, RegexOptions.IgnoreCase) which does:

Check queries for comments before running, catches 1

Expect union only when you know you want union. Catches 3

c. Count number of semicolons which does

Check the number of stacked queries and confirm with expected. Catches 2

In the swenging project, I have written a function that does above tasks (as a samples for using swe), in the context of swe state machine. It can be used as follows:

Assert.AreEqual(
ValidationResult.Comment, //A commont injection found
ValidateQuery(@"SELECT STUFF from Where I shouldn't' /* some Comment *."));

etc..

Dart, an effort from google to liberate the Web

I enjoy writing javascript code, but to be honest, the fact that javascript has no types, and anything imaginable is a valid code in javascript – therefore not much productivity tooling is possible – makes me not a fan of it for serious development.

I may dislike JS for serious development sometimes, but there are times that I truly hate JS!

When I say I sometimes hate JS, I am not joking, and it is not related to the features and looks of the language either. The sole reason for my feeling is that JS is the “Dictator” of the web. It imposes itself on you! It doesn’t care if you believe it is the right tool for your job or not, you “have to” love it! It kicked flash out, killed other competitors and now, it is the sole ruler of the web. The great dictator of the web if you like. And I hate dictators, even if they are not evil.

As is the case in any other dictatorships, there are liberating movements here and there. Google have been a pioneer of such movement from the GWT times. Microsoft tried to challenge JS power with Silverlight, a move that was repressed quickly. Google acted smarter, it did not try to challenge the power of JS, it worked with it and tried to make JS to bend to democratic “reforms”. By chromium’s V8 and other efforts to improve performance of JS, it became possible to compile to JS.

I don’t think compiling JS would completely solve the problem (I would much rather we had some VM specifications for browser instead of a language), but it is the best hope for now. The proof is that in 2012, close to two decades from the first version of JS, we still don’t have a complex JS based product that works fine. Even the most expensive and sought after products fail supporting different browsers gracefully. Yahoo mail has a big team of best JS devs in the world working on it, and yet after years, it still sucks in Chrome.

Dart however, is a lovely language. It stands somewhere between strongly typed languages like Java and C#, and java script. It takes best of the two worlds and has a modern, but familiar syntax, which is quick to learn and coding in Dart needs close to zero boilerplating.

  main() {
    print("Hello world! From Dart!");
  }

What makes Dart different, is that it compiles to JS. It is designed to compile to JS from scratch. Hence, you could imagine it would do the job better than others.

Other things that make Dart different are that first, it is supported by google, and second it has a couple of nice IDEs from start. Google support makes me believe Dart has a future (a bright one maybe) and good tooling makes it useful from the earliest version.

In the next posts, I will talk about some libraries I have developed for Dart. And in the future I would use Dart more often.