At the Hadrian Hotel

At the Hadrian Hotel

Monday, January 14, 2008

Computing in the Cloud: Possession and ownership of data

The first panel on Monday afternoon concerned the possession and ownership of data. Moderated by Ed Felten, the panelists were:

The panelists presented their positions and answered audience questions as well as responding to each other's positions. I don't feel that I can do justice to the panelists and what they said, so I'll just give a little flavor of what was said. When the video recording of the session becomes available, I'll add at the bottom of this post.

Tim Lee started things off with the position that privacy is governed by a series of trade-offs. Some data sharing is a pre-requisite for any useful online service and users are generally willing to give up some privacy in return for a valuable service. Some users will be willing to share more information for more value. Tim also spoke a bit about the history of browser cookies, GMail, and the Facebook news feed. All 3 of these things were initially looked at negatively by at least some segment of the user community. In all 3 cases, users became more accepting as they learned more about how the technologies worked, what "opt-out" options existed, and what benefits the users could derive from the technologies. A key point Tim made is that having private companies collecting data about you is less troubling than having the government doing the same. If you don't like the policies of a particular service provider, you can choose not to use that provider, as there are others around with different policies. There are no such choices available when it comes to the government.

Joel Reidenberg focused no 3 sets of implications: ownership of data, embedded values in the architecture, and irony. Data ownership is really how you get to use the bits and bytes. Fair information practice standards provide a control here. However, if data usage is based on a user consent model and the user doesn't understand, how can the model be effective? Joel also raised the question of whether data on social networking sites is public or private. Despite what many users may think, the data is generally public and can be accessed by anybody (including law enforcement). Next up, Joel talked about how privacy values are embedded in the architecture of a given technology. With the Facebook beacon fiasco, we got to see "how the data mining sausage was made" and it bothered quite a few people. We got to see what was going on behind the scenes in a way that was quite graphic when compared to GMail's ad scanning. Joel said that data privacy rules have to focus on effective transparency and proposed that a data usage rule set should travel along with data wherever it goes. Finally, he spoke of the irony that cloud computing actually opens the doors for privacy enhancement. Centralized data holders are easier to find, regulate and prosecute. However, we will need more cooperation in the future between lawmakers and standards bodies if we are to have effective data privacy standards and rules.

Marc Rotenberg gave an introduction to privacy culture. He presented the concept of fair information practices where the entity that collects data on individuals takes on obligations for security, accuracy and rights of access, among others. The custodian of the data has the responsibility to prevent "bad things" from happening to the data. Privacy people by and large believe that technology can be a solution to privacy problems, but the techniques need to be evaluated: having secure encryption keys will protect your data, but having a key escrow system will erode that protection in at least some (if not all) cases. Anonymity is critical to privacy. A person's actual identity should not be required to determine if they have the credentials to use a given service. Also, there is a paradox in that much of privacy is about transparency. Imposing obligations on custodians to be more open and accountable about the data they collect makes it easier to ensure that the data will only be used in known ways. The greater the secrecy about how data is being collected, the greater the possibility that it can be used in negative ways without people learning about it.

As I was trying to actually pay attention to what was being said as I was taking notes, I feel that I may have given short shrift to all 3 presenters and I encourage you to watch the video of the panel (once it becomes available - please check back for an update). That way you'll hear first-hand what they had to say. You'll also get to hear the lively debate that took place during the rebuttal and audience question section.

UPDATE: The video recordings from the workshop are now available at the Princeton UChannel.

Computing in the Cloud Workshop

Today is day one of the Computing in the Cloud Workshop being presented by Princeton's Center for Information Technology Policy. After opening remarks by H. Vincent Poor, Dean of the School of Engineering and Applied Science, Ed Felten got things rolling with "Computing in the Cloud: What, How and Why."

Starting with definitions of Cloud Computing from John Markoff, Wikipedia, the MIT Technology Review and Eric Schmidt, Ed then went on to expand on them and delve into the history and some of the implications. It's all about location, but why does it matter where the data and software actually is? Possession of data implies control, and control implies power. Whomever owns the systems on which data resides has the ultimate control of how that data is retained and who has access. If, for example, all of your EMail is in your Google Mail account, how confident are you that what you delete is actually gone forever? Are you confident that your data on a third party server will not be accessible by anybody else, except as you decide? If the government presents a subpoena to the holder of your data, what, if anything, will be released?

Ed also gave a broad over of how we got here, talking about the swing back and forth between centralized computing and a more distributed model. Early on, computers were big and expensive so there was an economic incentive to have the users come to the computer. This was followed by timesharing, where users had terminals at remote locations (such as their office), but the actual computer was still in a large, air-conditioned room somewhere. In the late 1970s and early 1980s PCs and Sun Workstations (for example) were available at a low enough cost that individual users could now have local computing. This gave users more autonomy and the potential for a more rich user interface, but at the loss of the lower cost per user, expert management and higher utilization that centralized computing facilities could provide.

During the 1980s and 1990s, the client-server computing model gained popularity but was soon overtaken by the World Wide Web. This swing of the pendulum took us back to a more centralized model of computing where all of the data and manipulations took place on a remote computer and the results were displayed locally, as in early timesharing. In the early 2000s, the web browser became more like a computing device as AJAX and other programming models came into existence. More like the client-server model, some computation takes place on the remotre "back end" and some on the local computer. This brings with it all of the complexities of client-server computing along with those inherent in trying to shoe-horn a computing engine into the browser. In addition, these applications are typically written in multiple programming languages such as SQL for database access, PHP for page generation, and combinations of HTML, XML and Javascript for local processing and display.

The modern tools and infrastructure available today make many interesting "real-time" applications possible. For example, during the Iowa Caucuses, the Democratic Party was able to utilize infrastructure from Amazon to present an "Iowa Democratic Party Caucus Results" web page, that was kept updated as results came in and was not adversely effected by the amount of traffic the page received. The tools of today also allow the creation of sites such as Facebook and ebay. Sites such as these would not have been easily created in the past.

With disk storage prices dropping and as a side-effect of the AJAX-type programming model, data is continually building up in remote data centers. It is in the best interest of the data center owner to hold onto that data for as long as possible as there is probably some value that can be extracted from it along the way.

There are additional concerns and implications to having your data on somebody else's server. How portable is your data? Can you easily extract it and move it to another provider if you so choose? Does your current provider have data retention policies that meet your needs? When you access your data, how secure and private is the connection between your computer and the provider's site? If an intermediary has lots of customer data and amkes it difficult for customers to move that data, the provider gains market power.

Concerns such as those above can be addresses in a number of ways. If a cloud computing provider is a "community" then the members of that community have a say about how their data is managed. A provider may also decide that they won't "be evil" and if you trust them to follow through, you may feel more secure about your data. There are also the options of ex post regulations that would control how a provider that has already amassed data must manage the data or ex ante agreements where the provider makes promises up-front as to how they will deal with data on their servers.

A number of the above issues and concerns were addressed in the first afternoon panel discussion: "Possession and Ownership of Data."

UPDATE: The video recordings of the workshop are now available at the Princeton UChannel.