Time to take functional programming seriously.
Time to take functional programming seriously.
Sweet spot for buying a used Honda Civic! Buy 2009, sell in 2 years! Would depreciate only $30/month.
In production systems, organisations that produce things, like mines, factories, refineries, etc. a whole system exists dubbed inventory. The inventory is a graph of materials and objects moving through the systems and changing face. For example, in a coal mine, dirt is digged out of pits in units called lots. The lot moves around the mine, gets merged, crushed, dried, and cleaned. The lot moves from the ground to trucks and process points, ports, railways, etc. where most likely it changes shape and characteristics. Same thing happens in manufacturing, chemicals, water procurements, and many other industries.
It is crucial and critical for the business to know their inventory at each stage. They usually have manual or automatic processes to measure characteristics of every part of the inventory. For example weightometers are fitted all over the system which generate data, and they may manually keep record of what happens on the ground. The data generated is almost always limited, of low quality and is not directly a reflection of the reality. In fact, computer science should come in and start filling the gaps, improve data quality and estimate inventories much better than the most skilled human can do by looking at the inventories and excel sheets. In this post I want to reflect on how three techniques, two from AI, and one from Quantum Computing can provide a set of tools to just do that.
Hidden Markov Models (HMM) are used to extract state machine in temporal data. For example, if we track Joe movements, the fact that Joe usually gets out of the kitchen, go to the Mary’s desk for five minutes and come back to his own desk, is a Markov chain of states hidden in the data. In inventories HMM can be very well used to extract sate machines. I refer to this paper for a good tutorial on HMMs. In the inventory settings, HMMs can be used to extract movements of materials and objects and the change in their characteristics. For example coal goes to the process point A and becomes dry and the dry coal goes to the crusher and becomes grained, and then goes to another process points and becomes pure by loosing ash. Or again coal goes into a process point and ash comes out. Such MMs can be extracted by looking at the data.
Bayesian statistics is simply the implications of the Bayes theorem: P(A|B) = P(B|A)P(A)/P(B). In computer science a mixture of Bayesian statistics and HMM is used for improving data quality in time series (read a good paper here). It is mainly done by forming correlation tables. For example, Joe always goes from Mary’s office to his own, so the correlation of Joe location between Mary’s office and Joe’s office is high, but he never goes from Mary’s office to the Jack’s office, so the correlation of Joe’s location between Mary’s and Jack’s office is zero. Using this knowledge, if we are in doubt if Joe is currently in his own office or in Jack’s office, we can check to see where has he been before. If he has been in mary’s office beforehand, we can estimate that Joe is probably in his own office. In literature this is called probabilistic easing. The probabilistic easing, however, does not produce an exact value of the location and characteristics of objects, instead it creates a set of probable location with their probability. For example in a coal mine, we may estimate that the lot X is out of the crusher with 20% probability and it is merged with lot Y with 80% probability. This dramatically extends our knowledge of the inventory, however, it is probabilistic, which is ok, because life and reality are probabilistic.
Quantum mechanics may seem irrelevant to a coal mine. However, Quantum mechanics is extremely beautiful, and I love to find a use for some of the formulas and methods coming out of Quantum computing in unrelated fields. There is one specific method in Quantum computing that might be relevant to our settings. It can become very useful to improve our estimations for eased data generated from the previous steps.
The problem of the eased data is of course that it is probabilistic (its both curse and blessing). In reality, inventories get measured occasionally and this measurements infer valuable information that are ignored in the easing approaches. For example, the easing method reports a specific inventory of coal to be with 60% probability high in Ash. The engineers on the ground sample some of the coal in this inventory and send it to lab, and the result comes that the coal has exactly 20% ash. The easing approach uses this fact merely to improve the easing from this point forward. However, quantum algorithms tell us that this fact contains much more information.
In quantum mechanics there is a state which is called entanglement. Entanglement is a very unintuitive phenomenon but we know that it happens. It means once two atoms are entangled, their electron are in the same superpositions, no matter how far apart those atoms are from each other. By the way, superposition is nothing new in this post. Remember we said that easing method says that Joe is with 10% probability in Jack’s office and 90% probability in his own office, this is called super position in quantum mechanics. Just replace Joe with your favorite electron. Now that we figure out the similarity with eased state of inventory and quantum state of matter, we can utilize the work in quantum computing to improve our knowledge of the state of inventory.
In quantum computing, once entangled atoms pass through quantum gates, their state changes, and we can compute their superpositions, which is something similar to Bayesian easing. However, in quantum computing, once we measure the sate of one of the atoms, we break the engagement, and the atom collapses to the new state which we measured. This is very similar to when we send the coal samples to lab and realize the exact state of our coal. In fact, the eased probabilistic state collapses to one value that has come out of lab. In quantum computing however, we use this data to calculate new superposition for all other atoms. For example, if the measurement of the output of the first quantum gate comes out as 001, we know that other gates can not be in an state which is inconsistent with the first gate being in 001. We can re-estimate the superposition of other atoms with some linear algebra. We can do exactly the same in inventory. For example, we can apply this new knowledge to back track and re-estimate the eased sate of all other bits of coal in the system but limit easing only to states that are consistent with the new knowledge out of lab.
This may sound a bit superficial and hard to implement, but remember that the Bayesian easing based on HMM is not my invention and people have been using it successfully. Adding the quantum flavour is not hard even because it is also implemented and used by many others in the computer science field. I don’t see a feasibility problem here. There might be a marketing problem but if you are interesting in implementing above, and you are research student, feel free to drop me an email.
Cost of Iraq and Afghanistan war in about a decade? $1.4E12
Cost of an average electric car today? $4E4
Considering 50% discount reduction when ordered in tens of millions!
Cost of war ~ 70M electric cars. (Half of the U.S. fleet)
Note: 98% of oil is burned in vehicles.
If the money spent on war was spent on electric cars, U.S. would be independent from oil import.
Electric cars would have been affordable already. Therefore, rest of the world would reduce oil consumption.
Therefore, funding for terrorist would have gone. Saudi arabia, Iran, Qatar, etc. wouldn’t have enough surplus to fund terrorism.
And world would have been a much nicer place to live.
Most of the focus of the DQ profiling commercial tools is around cleansing “Dimension” data (as referred in Data Warehouse terminology). However, the quality of facts is almost as important as dimensions. In this article I want to suggest a heuristic for identifying outliers in fact data which is infinitely better than nothing! – Anything is infinitely better than nothing, mathematically speaking.
The problem is called, abnormality detection. To be more specific, I am talking about outlier detection. What does that mean to the user?
Example: Real-estate data. There have been reports of an American citizen receiving a $200,000 tax bill for its 3-4 bedroom house in an average suburb. I couldn’t find the original article, so you should trust me on this. If you don’t want to trust me (which is the right thing to do) imagine similar problems that I am sure you have encountered.
The tax man has clearly issued an outlier for the specific sub-class of houses, e.g. a 3-4 bedroom in an average suburb. For such house, the tax should be something between $700-$2000. A $200K tax is a significant number obviously, but a good application should point out the outlier even if the tax is slightly out of order, e.g. $2500 for a house which should be taxed a bit less in that range.
Solution: Write a little algorithm, that learns the distribution of “fact” values in regards to the condition over several other dimensions. Excuse me for using the Data Warehouse methodology (Fact, and Dimension) instead of the usual machine learning methodology (e.g. features). I think the DW methodology makes more sense here, and I don’t want to justify it, so go ahead and replace the terms with your favorite ones.
1 – Discover facts, and dimensions. From data-type and their distribution, i.e. count(distinct(fact)) / count(fact) is close to 1. Alternatively, you can ask user to identify this.
2 – Filter out time and dates, as they don’t help us much in this setting. A recurring date dimension, like DayOfWeek, or IsHoliday, can be very useful though.
3 – For every Fact do:
3 – a ) For every Dimension do:
3 – a – i ) Measure the statistical distribution of filtered data limited with the dimension. Specifically measure Count, Mean, Variance, Min, and Max.
3 – a – ii ) Store the above statistics in a file along with the selected dimensions, IF Count > 30 and the sqrt(variance) << max – min.
3 – a – iii ) Recurse to (3 – a) and include more dimensions in the condition.
4 – Print out the rules discovered in 3 – a – ii.
Above algorithms is not optimized for performance, but in the case of DQ, who cares about performance? Just run in your Hadoop cluster :).
Output of the above algorithm is a set of rules like: For tax: 3-bedroom, and house, and average suburb, variance is … and mean is …, which means the values are expected to be between 700, and 1500. Your application can read these rules and apply it the data, or user interface to help users fix/avoid their outliers.
Evaluation: Can’t publish customer’s data, but I’ll do it on some public data, later. Only if people ask.
Based on a recent law, every cofee shop operating in Tehran has to install cameras monitored by government. These emotional photos depict the last day at Cafe Perague, which did not accept to install cameras and had to close down.