Wir brauchen eine offene Sync-Infrastruktur!

tl;dr: Einen Ansatz gibt es - bauen wir RFC 6578 in mod_dav ein!

Nach einem deutlichen Artikel auf the Verge hat sich in den letzten Wochen auch auf breiterer Front die Erkenntnis duchgesetzt, dass die iCloud ihr Versprechen vom magischen Sync definitiv nicht eingelöst hat.

Andererseits zeigte die einigermassen unerwartete Abkündigung von Google Reader, wie schnell eine "Infrastruktur" wegbrechen kann, wenn es sich eben nicht wirklich um Infrastruktur, sondern nur um ein Projektchen einer Firma wie Google handelt, die, aus was für Gründen auch immer, dessen überdrüssig wurde.

Das Gute an diesen Ereignissen ist - es wird klarer: wir brauchen eine Infrastruktur für Sync. Basierend auf Standards, mit offener Implementation, und nicht abhängig von einem bestimmten Provider.

Sync lässt sich grob in zwei Aufgaben unterteilen:

  • Eine effiziente Verteilung (insbesondere was Datentransfer anbelangt) der produzierten Daten an alle Teilnehmer
  • Das konsistente Zusammenführen der Daten von verschiedenen Beteiligten zu "einer Wahrheit"

Wie schwierig die zweite Aufgabe ist,kommt nur darauf an, was man will. Apple wollte nichts weniger als die Königslösung und nahm sich vor, komplexe Objektgraphen (CoreData) generisch zu synchronisieren. Das haben sie vorerst nicht mit einer brauchbaren Stabilität geschafft.

Will man aber "nur" einen Haufen Files überall im Zugriff, wie Dropbox, dann ist die zweite Aufgabe fast trivial, und alles viel einfacher. Der reine File-Sync funktioniert ja scheinbar sogar in der iCloud ganz gut.

Für die meisten Apps liegt der Aufwand für die Ermittlung einer in ihrem Kontext ausreichend konsistenten "Wahrheit" irgendwo dazwischen. Es wäre schön, wenn es da mit der Zeit bequeme Frameworks für diesen oder jenen Anwendungsfall gäbe, aber zu einer generischen Sync-Infrastruktur gehört dieser Teil nicht.

Der erste Aufgabe hingegen ist von der Sync-Infrastruktur zu erfüllen, die man von ganz klein auf dem eigenen Server installiert bis hin zu ganz gross skaliert bei einem Cloud-Anbieter betreiben können möchte. Genauso simpel, zuverlässing und austauschbar wie ein Webserver.

Es braucht dazu nicht nur einen Server, denn als App-Enwickler möchte man nicht die Files irgendwo abholen müssen, nicht Verzeichnisbäume nach Änderungen durchscannen. Sondern lokal ein Abbild der Daten haben, und benachrichtigt werden, wenn sich etwas ändert (analog dazu, wie kaum jemand direkt TCP-Sockets aufmachen will, um HTTP zu sprechen). Dazu braucht es ein passendes Client-Framework.

Dropbox nutzt jetzt die Gunst der Stunde, genau so ein Framework mit seinem neuen Sync-SDK anzubieten. Aber eine Standard-basierte offene Infrastruktur ist das nicht.

Offen und standardisiert hingegen ist WebDAV (RFC 4918). Und in der Tat nutzen es einige App-Entwickler bereits für ihren Sync-Bedarf. WebDAV hat aber ein Problem im Zusammenhang mit Sync - es ist bei einer grösseren Anzahl von Objekten (Files) nicht effizient, Änderungen zu finden.

Es sei denn, man hätte eine Implementation von RFC 6578: "Collection Synchronization for Web Distributed Authoring and Versioning (WebDAV)".

Wer hat eine? Apple. Warum? Weil sie das in der iCloud brauchen. Das ist vielleicht momentan nicht gerade die beste Werbung. Aber wer sich in Apple-Technologien ein bisschen auskennt, weiss, dass Apple genau dieses Muster viele Male sehr erfolgreich angewendet hat: solide Standards nehmen, offen weiterentwickeln, und dann als Basis für proprietäre "Magic" benutzen (z.B. HTTP Live Streaming, Bonjour/multicast DNS, CalDAV). Ich bin überzeugt, dass WebDAV Collection Sync nicht das Problem in der iCloud ist.

Mein Fazit:

Wir (damit meine ich die Indie-Dev Community) sollten Kräfte bündeln, RFC 6578 einerseits in mod_dav implementiert zu bekommen und so in nützlicher Frist auf jeden Apache-Webserver dieses Planeten auszurollen. Und am anderen Ende Sync-Frameworks zu bauen, die genau dasselbe bieten wie die neue Dropbox Sync-API, aber mit WebDAV als Backend.

Damit nicht nur wir Entwickler, sondern auch die Anwender unserer Apps wieder wählen können, bei wem die Daten liegen sollen.

Ist die Zeit endlich reif dafür?

Secure Cloud Storage with deduplication?

Last week, dropbox's little problem made me realize how much I was trusting them to do things right, just because they once (admittedly, long ago) wrote they'd be using encryption such that they could not access my data, even if they wanted. The auth problem showed that today's dropbox reality couldn't be farther from that - apparently the keys are lying around and are in no way tied to user passwords.

So, following my own advice to take care about my data, I tried to understand better how the other services offering similar functionality to dropbox actually work.

Reading through their FAQs, there are a lot of impressive sounding crypto acronyms, but usually no explanation of their logic how things work - simple questions like what data is encrypted with what key, in what place, and which bits are stored where, remain unanswered.

I'm not a professional in security, let alone cryptography. But I can follow a logical chain of arguments, and I think providers of allegedly secure storage should be able to explain their stuff in a way that can be followed by logically thinking.

Failing to find these explanations, I tried the reverse. Below I try following logic to find out if or how adding sharing and deduplication to the obvious basic setup (private storage for one user) will or will not compromise security. Please correct me if I'm wrong!

Without any sharing or upload optimizing features, the solution is simple: My (local!) cloud storage client encrypts everything with a long enough password I chose and nobody else knows, before uploading. Result: nobody but myself can decrypt the data [1].

That's basically how encrypted disk images, password wallet and keychain apps etc. work. More precisely, many of them use two-stage encryption: they generate a large random key to encrypt the data, and then use my password to encrypt that random key. The advantage is performance, as the workhorse algorithm (the one that encrypts the actual data) can be one that needs a large key to be reasonably safe, but is more efficient. The algorithm that encrypts the large random key with my (usually shorter) password doesn't need to be fast, and thus can be very elaborate to gain better security from a shorter secret.

Next step is sharing. How can I make some of my externally stored files accessible to someone else, but still keeping it entirely secret from the cloud storage provider who hosts the data?

With a two stage encryption, there's a way: I can share the large random key for the files being shared (of course, each file stored needs to have it's own random key). The only thing I need to make sure is nobody but the intended receiver obtains that key on the way. This is easier said than done. Because having the storage provider manage the transfer, such as offering a convenient button to share a file with user xyz, this inevitably means I must trust the service about the identity of xyz. Best they can provide is an exchange that does not require the key be present in a decrypted form in the provider's system at any time. That may be an acceptable compromise in many cases, but I need to be aware I need a proof of identity established outside the storage provider's reach to really avoid they can possibly access my shared files. For instance, I could send the key in a GPG encrypted email.

So bottom line for sharing is: correctly implemented, it does not affect the security of the non-shared files at all, and if the key was exchanged securely directly between the sharing parties, it would even work without a way for the provider to access the shared data. With the one-button sharing convenience we're used to, the weak point is that the provider (or a malicious hacker inside the providers system) could technically forge identities and receive access to shared data in place of the person I actually wanted to share with. Not likely, but possible.

The third step is deduplication. This is important for the providers, as it saves them a lot of storage, and it is a convenience for users because if they store files that other users already have, the upload time is near zero (because it's no upload at all, the data is already there).

Unfortunately, the explanations around deduplication get really foggy. I haven't found a logically complete explanation from any of the cloud storage providers so far. I see two things that must be managed:

First, for deduplication to work, the service provider needs to be able to detect duplicates. If the data itself is encrypted with user specific keys, the same file from different users looks completely different at the provider's end. So neither the data itself, nor a hash over that data can be used to detect duplicates. What some providers seem to do is calculating a hash over the unencrypted file. But I don't really undstand why many heated forum discussions seem to focus on whether that's ok or not. Because IMHO the elephant in the room is the second problem:

If indeed files are stored encrypted with a secret only the user has access to, deduplication is simply not possible, even if detection of duplicates can work with sharing hashes. The only one who can decrypt a given file is the one who has the key. The second user who tries to upload a given file does not have (and must not have a way to obtain!) the first user's key for that file by definition. So even if encrypted data of that file is already there, it does not help the second user. Without the key, that data stored by the first user is just garbage for him.

How can this be solved? IMHO all attempts based on doing some implicit sharing of the key when duplicates are detected are fundamentally flawed, because we inevitably run into the proof of identity problem as shown above with user-initiated sharing, which becomes totally unacceptable here as it would affect all files, not only explicitly shared ones.

I see only one logical way for deduplication without giving the provider a way to read your files: By shifting from proof-of-identity for users to proof-of-knowledge for files. If I can present a proof that I had a certain file in my possession, I should be able to download and decrypt it from the cloud. Even if it was not me, but someone else who actually uploaded it in the first place. Still everyone else, including the storage provider itself, must not be able to decrypt that file.

I now imagine the following: instead of encrypting the files with a large random key (see above), my cloud storage client would calculate a hash over my file and use that hash as the key to encrypt the file, then store the result in the cloud. So the only condition to get that file back would be having had access to the original unencrypted file once before. I myself would qualify, of course, but anyone (totally unrelated to me) who has ever seen the same file could calculate the same hash, and will qualify as well. However, for whom the file was a secret, it remains a secret [2].

I wonder if that's what cloud storage providers claiming to do global deduplication actually do. But even more I wonder why so few speak clear text. It need not to be on the front page of their offerings, but a locigally conclusive explanation of what is happening inside their service is something that should be in every FAQ, in one piece, and not just bits spread in some forum threads mixed with a lot of guesses and wrong information!

 

[1] Of course, this is possible only within the limits of the crypto algorithms used. But these are widely published and reviewed, as well as their actual implementations. These are not absolutely safe, and implementation errors can make them additionally vulnerable. But we can safely trust that there are a lot of real cryptography expert's eyes on these basic building blocks of data security. So my working assumption is that the used encryption methods per se are good enough. The interesting questions are what trade offs are made to implement convenience features.

[2] Note that all this does not help with another fundamental weakness of deduplication: if someone wants to find out who has a copy of a known file, and can gain access to the storage provider's data, he'll get that answer out of the same information which is needed for deduplication. If that's a concern, there's IMHO no way except not doing deduplication at all.