Oct 12 2009

MSFT/Danger Fiasco: Cloudy with a Chance of Negligence?

Published by David HM Spector at 10:44 PM under Microsoft, Technology, Web2.0

The fact that it seems most if not all SideKick customers are in extreme danger of losing all their data is being cited be all sorts of technical pundits as an example of the “dangers of cloud computing.” Others are now warning that Microsoft’s black eye will taint the still nascent cloud computing business and scare the bejesus out of IT execs…

I would like to call BS on this and fast: Microsoft is not running a cloud computing operation – just because all Sidekick data was stored on servers at Microsoft doesn’t make the Danger service a “cloud” service that can be lumped in with the likes of Amazon Web Services (EC2/S3/EBS et al), SalesForce, or any of the dozens of other cloud computing and storage services.

Microsoft committed what appears to be one of the most egregious, sloppy, incompetent and probably criminally negligent acts in modern computing services history. I say “appears” because no one (probably even Microsoft) knows the true timeline of what actually happened yet. It appears from what can be gathered from the leaks and other “insider” reports that both the Microsoft staff and their outsourced SAN vendor Hitachi failed to follow the most basic of data operations management practices: failure to back up critical systems.

Backups and disaster recovery are the mother’s milk of anyone working in data intensive, mission critical businesses. I worked on Wall St for the better part of 20 years and the most mission critical systems are actually NOT those running in production – the real “mission critical” systems are the backup mechanisms that allow you to fail over in case of a production failure. Anyone running a large enterprise knows that systems fail all the time – it’s your ability to recover from a failure — ANY failure — that proves how well you have written your “run book,” and how you teams respond to even the smallest outage that will show how well engineered your environment actually is.

What makes Microsoft’s Sidekick/Danger environment different from “the cloud?” Well, first off, we have come to think of the cloud as a distributed set of systems spread out across the Internet. SalesForce.com (SFDC), and Amazon Web Services (AWS) are all actual cloud services. They exist in multiple redundant data centers: You can see their respective documentation of how (in general) these systems are constructed; they have what’s known in the biz’ as “geographic and route diversity” (read: their data centers are in different places and interconnected by several different ISPs). They have multiple, redundant backups of their (and your) data. And, most importantly, for the purposes of this discussion they allow you to back up all of your data to your own off-line storage if you wish (either down to your personal computer to another online provider).

In contrast, according to the reports the Microsoft/Danger service was hosted in a single Microsoft Data Center. On a Single SAN (Storage Area Network) that, apparently, wasn’t ever backed up. (How do I know that you might ask? Well, clearly if it was ever backed up they could at least restore user’s data back to some date in the past when the last backup was done – since they cannot it was clearly NEVER backed up – Oh, and don’t buy the “rougue disgruntled employee who wiped that backup tapes” rumor – that’s just childish and poorly executed ass-covering – and if it were true Microsoft has an even larger and scarier issue than just this fiasco…think about it…).

What Microsoft failed to do was apply the basic data hygiene (read: common sense operational practices) that they recommend to every single one of their enterprise customers. They compounded their mistake by making their customers dependent upon them to the point where they could not even back up their own data on their own personal computers.

So, this isn’t a story about the supposed dangers surrounding or failure of “the cloud,” but rather a story of arrogance, hubris, disregard for customers, and at the end of the day what amounts to a simple, and what will surely be for Microsoft an extremely costly, bit of professional negligence.

[Update: Newer reports now are adding an interesting twist to this story - the new bit is that Microsoft may have been trying to convert the Sidekick service onto some flavor of Microsoft replacement platform they'd developed to replace the software/services they got when they bought Danger (makers of the Sidekick). If this is true they committed even more cardinal computing sins: not only didn't they have backups, but they 1) didn't have a back-out strategy for this "upgrade." And, 2) they upgraded (again, without backups) their production system rather than running a "new" (replacement/duplicate) system in parallel with their old one. I mean this is one of the richest tech companies in the world - they couldn't spend a few million to replicate the hardware to install their new software on..? Oh, Puh-leeeze...

Think about that for a second: If this new bit is true, they assumed they had backups (no one bothered to verify this) AND they were upgrading their LIVE system to a new collection of software that had never been run in production with NO WAY to go back if it didn't work!

Wow. I mean, WOW. In most of the business world they wouldn't find enough of the team that pulled such a disastrous stunt to identify even with DNA analysis.

If these new allegations are true there's gonna be a lot of blood and money on the floor -- and there will need to be a criminal investigation into all of Microsoft's operations: How much of Microsoft's operation runs in this way? A sad but true fact is a LOT of the world's critical infrastructure runs on all the various flavors of Windows, MS SQL Server, and all of their other products. Could the ability of Microsoft to maintain and release new versions of Windows disappear in a botched SAN upgrade with no backups? Are we one poorly planned internal Microsoft upgrade from oblivion? I'm not sure I want to know...]

One response so far

One Response to “MSFT/Danger Fiasco: Cloudy with a Chance of Negligence?”

  1. pmshah says:

    I am really surprised at such criminal negligence. I as an engineer with zero background in IT did better than this in my working environment. I had my critical data – essentially accounting data – not backed up but replicated on 5 different computers within my network setup. I could run the package from any one of these PCs. I had 6 additional replications – one for each day of the week and then had copies of this data carried home every day of the week on 6 separate DVDs – one for each working day of the week.

    My daughter worked as a Unix administrator in their securities trading division for a multinational bank in New York. Their main job did not start until the trading hour was over and they got busy backing up huge amounts of data to multiple offshore locations.

    But then what do you expect from a company which has the PC market by the … you know what.

Leave a Reply

You must be logged in to post a comment.