An Engineer

An Instance of Perspective

Posts Tagged ‘Backup

Danger data loss give hosted services a bad name

with 9 comments

Last week we learned that Danger, a subsidiary of Microsoft, has lost huge amounts of customer data. Danger makes the sidekick smartphone, and they offer a service to synchronize the phone (contacts, photos, etc) to hosted servers, aka, the Cloud. The critics wanted to know “why was there no backup?” And of course then there was the inevitable refrain that if you want to keep your data, you should be backing it up yourself and not relying on cloud services.

I think this is entirely wrong. Using a cloud service should free the consumer from having to do backup. Most times when you use a cloud service, backup is not even possible. How do you backup your gmail account? How about your facebook account?

The whole reason to use a hosted service is to free you of having to deal with the muck of running your own servers and doing backup. It makes sense precisely because building a reliable service is so difficult.

I am not on the inside at Danger, but I think I know how this happened. While people think that Danger lost a lot of data, the truth is, they lost very little. The type of data they hosted (contacts, text emails) is small compared to photo and video data, of which they had relatively little. My guess is that they lost under 10 terabytes of data. It might have been under a terabyte. And when you don’t host huge amounts of data, you might be tempted to just put it on RAID’ed servers and try to do nightly backups. Turns out, its very hard not to lose all your data when you use RAID.

RAID pretty much requires you to run nightly backups or rely on a proprietary replication scheme. RAID is sold as being completely reliable but anyone who has used RAID knows this is far from true. Double disk failures are more common than expected, especially when drives are from the same lot. Sometimes you lose the whole RAID chain. Replacement of disks is a manual process and sometimes people replace the wrong disk. Corruption of a RAID volume is not unheard of.

Then there is the backup window, which becomes longer each day until you finally start spending more hours backing up the data than there are hours in the day (been there, done that). And when backups are occurring, the performance of the RAID is significantly degraded. In sum, RAID does not scale. And any service that gets large enough eventually abandons RAID for some distributed solution that scales better, and coincidentally, is a lot more resistant to losing all the data at once.

Phanfare uses Amazon S3 for storage of photos and videos. Amazon is fairly vague on how it works, and we are under NDA, but the basic story is that it works much like other modern distributed file systems. It keeps multiple copies on multiple servers, geographically distributed, and has a scheme for replicating data when it programatically detects that a copy of an object has been lost.

As such, there are no backups of Phanfare. Yup, that’s right. We don’t backup the image and video data. It’s on Amazon S3 and that system uses an approach to persistence that is fundamentally different than the approach that bit Danger in the you know what.

Truth is, backups serve two purposes in am modern system. They do help assure that you don’t lose data to a system problem. And they serve as checkpoint against human error of deliberately deleting data.

The problem with S3 is, when you give it to the command to delete a file, it gets deleted, reliably. There is no going back to last night’s checkpoint. To combat this issue, we don’t really run deletes when end users delete their images. We wait a while. And we have a trash can system to make absolutely sure you want to delete data. Waiting on the deletes is really to protect against a systemic failure on our part (rogue code that deletes files).

We still use some RAID storage at Phanfare for some relational database systems holding meta data. The web service caches this data using memcache. (Another rule of large scale systems is that relational databases don’t scale either). At some point, we will scale past being able to use RAID and caching for that. Until that point, we do have to perform old school backups of the relational database to a secondary data center. And I worry a lot more about those than I do about the image and video data at Amazon S3.

The whole Danger incident sends the wrong message. Companies are much better at keeping data reliably compared to consumers. That Danger dropped the ball should not indict the whole industry. Instead, consumers should demand that companies be more transparent about their approaches to keeping data reliably.

In recognition that the ultimate risk is always that you do not know all the risks, we also offer a DVD subscription service that returns your data to you incrementally over time, automatically, so both we have and you have it.

Written by erlichson

October 11, 2009 at 9:52 am