Facebook has no need for deleting data

April 6, 2009 | 2 comments

Niall Kennedy has written an interesting post about Facebook’s data storage. They’ve written a proprietary filesystem to store photos in order to cut costs (up to now they’ve apparently been adding a $2 million NetApp storage system every week).

It turns out they’ve decided they don’t need all the features you’d find in a traditional file system (emphasis mine):

Traditional file systems are governed by the POSIX standard governing metadata and access methods for each file. These file systems are designed for access control and accountability within a shared system. An Internet storage system written once and never deleted, with access granted to the world, has little need for such overhead.

It would be nice if someone from Facebook could confirm that they do, in fact, have the ability to physically delete a photo or other items of data, and that this does, in fact, happen on the back end if you ask it to.

From what we understand of Facebook’s architecture, it probably doesn’t. When you post something, it gets copied and broadcast to your friends’ feeds; the data is out there forever. Even when you delete an account, your details aren’t fully removed. Surely, if nothing else, this is a legal minefield for the company?

The mechanics of "open"

March 9, 2009 | Leave a comment

PanelSince we started Elgg, I’ve always kept a very open philosophy about how the software should work. From the human perspective, we wanted it to be as inclusive as possible, with an easy-to-use interface and innards that allowed you to do very technical things (like, in Elgg 0.x, republishing aggregated RSS) with very little knowhow. From the organizational perspective, we didn’t want there to be a barrier to entry; we released it under the GNU Public License and allowed anyone to download and install it for free. And technically, we allowed anyone to augment, extend and replace its functionality, maintained an open architecture and embraced technologies like FOAF, RSS and so on.

That was five years ago. The world is only now beginning to catch up.

The Silicon Valley Web community is buzzing with “open” ideas: data portability, the open stack, the open mesh, OpenID, OAuth, and so on. There have been two Data Sharing Summits, a bunch of Identity Workshops, and efforts are crystallizing around open activity streams, contacts sharing, and virtually anything else you might want to transfer between web applications. David Recordon, co-creator of OpenID and all-round cheerleader for openness, has predicted that Facebook won’t be a walled garden by 2010.

This is fantastic stuff, which I intend to get even more involved with as the year progresses. Good work is happening all round, and even sleepy behemoths like Microsoft are beginning to take notice.

What worries me slightly is that the work is centered around the Silicon Valley community, and within that is largely built with public-facing commercial websites in mind. Those sites (like Digg, MySpace, the SixApart properties and so on) are awesome without a doubt, but the potential of social technologies falls well beyond the commercial web. People are beginning to use them on intranets, within universities, across governmental departments and so on – places that could use the same approaches, but need to be represented in the discussions.

Their exclusion is not the fault of the people producing the standards and doing this great work; they’re very happily welcoming anyone with a productive contribution to the table. Instead, it falls to those organizations to realize what they’re missing out on and begin to pay more attention to cutting edge technology. The Obama administration is certainly waking up to this, but others – notably the UK government – are extremely reticent to embrace anything open at all.

The technology is falling into place to allow for an open, transparent, knowledge-orientated economy. Now it’s time to look at what else is needed.

User control on the open web

February 21, 2009 | 9 comments

Data portability and the open data movement (“the open web” for simplicity’s sake) revolve around the idea that you should be able to take your data from one service to another without restriction, as well as control who gets to see it and how. Very simply, it’s your data, so you should have the ability to do what you like with it. That means that, for example, if you want to take your WordPress blog posts and import them into MovableType (WordPress’s competitor), you should be able to. Or you should be able to take your activity from Facebook and include it in your personal website, or export your Gmail contacts for backup or transfer to a rival email service.

You can do this on your desktop: for example, you can open a Word document in hundreds of wordprocessors, and Macs will happily talk to Windows machines on a network. According to Stellar SEO in Nashville, TN allowing this sort of data transport is good for the web in the same way it’s good for offline software: it forces companies to compete on features rather than the number of people they can lock into their services. It also ensures that if a service provider goes out of business, a user’s data on that service doesn’t have to disappear with it.

In 2007, before the open web hit most peoples’ radars, Marc Canter organised the first Data Sharing Summit, which was a communal discussion between all the major Silicon Valley players, as well as many outside companies who flew in specially to participate (I attended, representing Elgg). One of the major outcomes was the importance of central control: the user owns their data. Marc, Joseph Smarr, Robert Scoble and Michael Arrington co-signed a Bill of Rights for the Social Web which laid these out. It wasn’t all roses: most of the large companies present took issue with the Bill of Rights, and as I noted in my write-up for ZDNet at the time, preferred the term “data control” rather than “data ownership”. The implication was simple: users didn’t own the data they added to those services.

Since then, the open web has been accelerating as both an idea and a practical reality. Initiatives like Chris Saad’s Dataportability.org, Marc Canter’s Open Mesh treatise, as well as useful blunders like Facebook’s recent Terms of Service mis-step, have drawn public attention its importance. Facebook in particular force you to license your content to them indefinitely, and disable (rather than delete) your account details when you choose to leave the site. Once you enter something into Facebook, you should assume it’s there forever, no matter what you do. This has been in place for some time to little complaint, but when they overreached with their licensing terms, it made international headlines across the mainstream press: control over your data is now a mainstream issue.

Meanwhile, technology has been improving, and approaches have been consolidated. The Open Stack is a collection of real-world technologies that can be applied to web services in order to provide a base level of openness today, and developments are rapidly emerging. Chris Messina is leading development around activity streams portability, which will allow you to subscribe to friends on other services and see what they’re up to. The data portability aspect of the open web is rapidly becoming a reality: you will be able to share and copy your data.

Your data will be out there. So, what happens next?

The same emerging open web technologies which allow you to explicitly share your data from one service to another will also allow tools to be constructed cheaply out of functionality provided by more than one provider. Even today, a web tool might have a front end that connects behind the scenes to Google (perhaps for search or positioning information), Amazon (for storage or database facilities), and maybe three other services. This is going to drive innovation over the next few years, but let’s say a user on that conglomerated service wants to delete their account. Can they reliably assume that all the component services will respect his or her wishes and remove the data as requested?

As web tools become more sophisticated, access control also becomes an issue. When you publish on the web, you might not want the entire world to read your content; you could be uploading a document that you’d like to restrict to your company or some other group. How do these access restrictions persist on component services?

One solution could be some kind of licensing, but this veers dangerously close to Digital Rights Manamgent, the hated technology that has crippled most online music services and players for so long and inhibited innovation in the sector. Dare Obasanjo, who works for Microsoft and is usually a good source for intelligent analysis, recently had this to say:

[..] I’ve finally switched over to agreeing that once you’ve shared something it’s out there. The problem with [allowing content to be deleted] is that it is disrespectful of the person(s) you’ve shared the content with. Looking back at the Outlook email recall feature, it actually doesn’t delete a mail if the person has already read it. This is probably for technical reasons but it also has the side effect of not deleting a message from someone’s inbox that they have read and filed away. [..] Outlook has respected an important boundary by not allowing a sender to arbitrarily delete content from a recipient’s inbox with no recourse on the part of the recipient.

The trouble is that many services make money by selling data about you, either directly or indirectly, and these are unlikely to relinquish your data (or information derived from it) without some kind of pressure. I agree with Dare completely on the social level, with content that has been shared explicity. Certainly, this model has worked very well for email, and people like Plaxo’s John McCrea are hailing the fall of ‘social DRM’. However, content that is shared behind the scenes via APIs, and content that is shared inadvertently when agreeing to perform an action over something like OAuth or OpenID, need to obey a different model.

The only real difference between data shared as a deliberate act and data shared behind the scenes is user interface. Everyone wants the user to have control over data sharing via a clear user interface. Should they also be able to enforce what’s done with that data once it transfers to a third-party service, or should they trust that the service is going to do the right thing?

The open web isn’t just for trivial information. It’s one thing to control what happens to my Dopplr information, or my blog posts, or my Flickr photographs. I really don’t mind too much about where those things go, and I’d imagine that most people would agree (although some won’t). Those aren’t, however, the only things the web is being used for: there are support communities for medical disorders, academic resources, bill management services, managed intranets and more out there on the web, and these will begin to also harness the benefits of the open web. All of them need to be careful of their data. Some of them need to do so for legal reasons; some of them need to do so for ethical reasons. Nonetheless, they could all benefit from securely being able to share data in a controlled way.

To aid discussion, I propose the following two categories of shared data:

  • Explicit shares – information that a user asks specifically to share with another person or service.

Examples:

  • Atomic objects like blog posts, contacts or messages
  • Collections like activity streams

 

  • Implicit shares – information that is shared behind the scenes as a result of an explicit share, or to provide some kind of federated functionality.Examples:
    • User information or shadow accounts transferred or created as a result of an OpenID or OAuth login
    • User settings
    • User contact details, friend lists, or identifiers

 

For the open web to work, both clearly need to be allowed. At a very base level, though, I think that users need to be aware of implicit shares, in a clear, non-technical way. (OpenID and OAuth both allow the user to grant and revoke access to functionality, but they don’t control what happens to the data when access is granted once, which is likely to be kept.) They also need to provide a facility for reliably controlling this data. Just as I can Creative Commons license a photograph and allow it to be shared while restricting anyone’s ability to use it for commercial gain, I need to be able to say that services can only use my data for a limited time, or for limited purposes. I’m not calling for DRM, but rather a published best practice that services would adhere to and publicly declare their allegiance to.

Without this, the usefulness of the open web will be limited to certain kinds of use cases – which is a shame, because if it’s allowed to reach its full potential, it could provide a new kind of social computing that will almost certainly change the world.

« Previous Page