Greetings and welcome my 33rd interview with a thought leader in the “connected technology” space. This month, we’ve got the distinct pleasure of talking to Allan Mitchell. Allan is a SQL Server MVP, speaker and both joint owner and Integration Director of the new consulting shop, Copper Blue Consulting. Allan’s got excellent experience in the ETL space and has been an early adopter and contributor to Microsoft StreamInsight.
On to the questions!
Q: Are the current data integration tools that you use adequate for scenarios involving “Big Data”? What do you do in scenarios when you have massive sets of structured or unstructured data that need to be moved and analyzed?
A: Big Data. My favorite definition of big data is:
“Data so large that you have to think about it. How will you move it, store it, analyze it or make it available to others.”
This does of course make it subjective to the person with the data. What is big for me is not always big for someone else. Objectively, however, according to a study by the University of Southern California, digital media accounted for just 25% of all the information in the world. By 2007, however, it accounted for 94%. It is estimated that 4 exabytes (4 x 10^19) of unique information will be generated this year – more than in the previous 5,000 years. So Big Data should be firmly on the roadmap of any information strategy.
Back to the question: I do not always have the luxury of big bandwidth so moving serious amounts of data over the network is prohibitive in terms of speed and resource utilization. If the data is so large then I am a fan of having a backup taken and then restoring it on another server because this method tends to invite less trouble.
Werner Vogels, CTO of Amazon, says that DHL is still the preferred way for customers to move “huge” data from a data center and put it onto Amazon’s cloud offering. I think this shows we still have some way to go. Research is taking place, however, that will support the movement of Big Data. NTT Japan, for example, have tested a fiber optic cable that pushes 14 trillion bits per second down a single strand of fiber – equivalent of 2660 CDs per second. Although this is not readily available at the moment, the technology will be in place.
Analysis of large datasets is interesting. As TS Eliot wrote in his poem, ‘Where is the knowledge we have lost in information?’ There seems little point in storing PBs of data if no-one can use it/analyze it. Storing for storing’s sake seems a little strange. Jim Gray talked about this in his book “The Fourth Paradigm” a must read for people interested in data explosion. Visualizing data is one way of accessing the nuggets of knowledge in large datasets. For example, new demands to analyze social media data means that visualizing Big Data is going to become more relevant; there is little point in storing lots of data if it cannot be used.
Q: As the Microsoft platform story continues to evolve, where do you see a Complex Event Processing engine sit within an enterprise landscape? Is it part of the Business Intelligence stack because of its value in analytics, or is it closer to the middleware stack because of its event distribution capabilities?
A: That is a very good question and I think the answer is “it depends.”
Event distribution could lead us into one of your passions, BizTalk Server (BTS). BTS does a very good job of doing messaging around the business and has the ability to handle long running processes. StreamInsight, of course, is not really that type of application.
I personally see it as an “Intelligence” tool. StreamInsight has some very powerful features in its temporal algebra and the ability to do “ETL” in close to real-time is a game changer. If you choose to load a traditional data warehouse (ODS, DW) with these events then that is fine, and lots of business benefit can be gained. A key use for me of such a technology is the ability to react to events in real-time. Being able to respond to something that is happening, when it is happening, is a key feature in my eyes. The response could be a piece of workflow, for example, or it could be a human interaction. Waiting for the overnight ETL load to tell you that your systems shut down yesterday because of overheating is not much help. What you really want is be able to notice the rise in temperature over time as it is happening, and deal with it there and then.
Q: With StreamInsight 1.2 out the door and StreamInsight Austin on the way, what are additional capabilities that you would like to see added to the platform?
A: I would love to see some abstraction away from the execution engine and the host. Let me explain.
Imagine a fabric. Imagine StreamInsight plugged into the fabric on one side and hardware plugged in the other. The fabric would take the workload from StreamInsight and partition it across the hardware nodes plugged in to the fabric. Those hardware nodes could be a mix of hardware from a big server to a netbook (Think Teradata) StreamInsight is unaware of what is going and wouldn’t care even if it did know. You could then have scale out of operators within a graph across hardware nodes ala Map Reduce. I think the scale out story for StreamInsight needs strengthening / clarified.
Q [stupid question]: When I got to work today, I realized that I barely remembered my driving experience. Ignoring the safety implications, sometimes we simply slip into auto-pilot when doing the same thing over and over. What is an example of something in your professional or personal life that you do without even thinking about it?
A: On a personal level I am a keen follower of rules around the English language. I find myself correcting people, senior people, in meetings. This leads to some interesting “moments”. The things I most often respond to are:
1. Splitting of infinitives
2. Ending sentences with prepositions
On a professional level I always follow the principle laid down in Occam’s razor (lex parsimoniae):
“Frustra fit per plura quod potest fieri per pauciora”
“When you have two competing theories that make exactly the same predictions, the simpler one is the better.”
There is of course a more recent version of Occam’s Razor: K.I.S.S. (keep it simple stupid)!
Thanks Allan for participating!
What is big for me is not always big for someone else. Great Answer !!!