I’m really interested in privacy and how my data are collected, used, stored and just in general exists. While I think of my data as mine, the more I think about it, the hazier I am. I think defining ownership more clearly will help me think of what’s possible and then what’s also appropriate and reasonable to do with data.

Limits to using my data

I was reading about how Google is changing Chromium to remove and restructure some features it wants to be Chrome exclusive and it was surprising to me that they would restrict access to users data and only allow access to sync data from the Chrome application, not Chromium browsers. For me, I consider my data as part of me and distinguish between my digital exhaust that’s captured about me and data that I create and want to manage in the way I see fit. If I push sync data to Google from Chrome, shouldn’t I be able to pull sync data from Google to anywhere? Not for long. I’m guessing that they are changing this because syncing from Chrome makes them more money than synching from non-Chrome.

It’s a bit unclear to me what the legal requirements are on Google for providing access to these data. Google has an extensive privacy web site and privacy policy. But it’s not quite clear as to data that I put into Google (eg, Sync and youtube videos and docs and whatnot) and what Google generates related to me (eg, analytics, behavior). They do refer to “your data” and “your information” so there it seems like there is some distinction, but I’m not sure what’s what.

I think it’s perfectly legal, but I’m not sure if it’s ethical and right.

What kinds of “my data”

Data are slippery. Not only is it just the digital vs. physical aspect, but pretty much any use of data generates more data. And some is important for different things, most I don’t care about. I remember in the mp3 era when people were concerned about making copies of songs that would prevent CD sales, but that was at least tangible in that an mp3 was a single song that you could listen to and had a sort of real world analog that let us approximate value. So a lump of mp3 data became worth $1 and if I bought it, I owned it and could think about a collection of items. Of course, $1 is kind of arbitrary and maybe it could be $2 or some other amount, but at least it’s something specific with a rationale and market behind it. I don’t actually agree with this valuation, but use it here because some people do in that they sell and buy songs around this amount.

But what is the value of this markdown file? It’s not zero, but maybe it’s near-zero, or only near-zero to me. It’s valuable enough to me that I want to pay to host it, and I’ll back it up so it’s still around in 10 years.

I think there’s three kinds of data that are mine:

  1. Data I’ve bought and at least has a specific value from some point in time
  2. Data I’ve consciously created and has some value to me
  3. Data created indirectly by me doing stuff

What kind of data is Chrome Sync?

So I guess my config data from Chrome is a mix of #2 and #3 since it’s bookmarks I’ve made, passwords I’ve saved, but it’s also browsing history, cookies, caches, etc. that are made by me, but not really consciously. I may not even know what exists specifically, but I want it safe and don’t want it used against me.

According to Google’s privacy pages, these data are mine, but there’s also more data created by Google related to me that’s not mine. So I guess Sync data are mine and Google’s. And I don’t think this is unique to Google and I expect every other web site and remote system does the same. I should point out that this isn’t a criticism of Google and if anything they are at least conscientious about how they manage data and what they do with it.

This seems like there’s a fourth kind of data where it’s not actually mine, but it’s about me. I assume it’s stored tied to my account, or my IP, or something else that makes it mine and not someone else’s. So these data would be subject to initiatives like Right to be Forgotten through GDPR or other laws. I think GDPR calls this “personal data” and it seems similar to what I would get if I used Google Data Liberation Front’s Takeout service (a great example of how Google thinks about data pretty generously).

So I can get my data from Chrome Sync through Google Takeout, but not by real-time using their API into any client I want.

Accessibility is Important

Practically speaking, I assume that any data that leaves a system under my control is copied forever and used to the extent of its value to someone. While it’s cool that companies operate legally, I’m not quite sure what the legal limit is for how my data can be used. And I’m pretty sure that any of my data can be “anonymized” and sold in many ways. So I mentally figure that “possession is nine-tenths of the law” and that anyone with my data will either eventually use it a way that I don’t approve, or it will leak to someone who will use it in ways I don’t approve.

I don’t depend on any remote data without backups and backups of those backups, but in this case if I want the data that I’m storing with Google, I can’t get it in the way I want. I can only get it in the way they want, and while it’s weird that they will only allow API access from me to my data from certain programs, I guess that’s their prerogative. So makes me not want to use them for important things.

This means that I not only need to know what’s mine, but I need to be able to access it and get to it in the format and medium that I reasonably prefer (ie, print out in RDF fedexed to me != unreasonable, small jsons over https == reasonable).

What if I am data?

I like to read science fiction and particular like Greg Egan’s books, particularly Diaspora and Permutation City. These books deal with digital consciousness and simulated consciousness. I know it’s a bit reductio ad absurdum, but for arguments, I like to think of the end consequences of what this means. Using scenarios like if in the far future, when I’m running in a computer, data rights become much more important. If I had my druthers, I’d want to run all the systems so I can guarantee integrity, but I’m guessing running a human will be pretty resource intensive and if it comes down to no consciousness or running in some shared cloud space, some will choose the cloud.

It seems like a bad thing if cloud vendors can choose to not allow I/O unless there are specific programs they want me to use. Will I be forced to run Chrome for my consciousness to remember what links I’ve clicked on? My kid frequently asks for root access to install “cheat blocker” software for games, will I only be allowed to access my data if I run special software (that just so happens to make money by showing ads).

Black Mirror's San Junipero Giant Computer Installing New Consciousnesses

Black Mirror had an episode called San Junipero where people were stored and lived digitally. That seems like a cloud data scenario, so now I’m thinking of what kind of data use agreement I would require signing before I would upload. That’s probably the most important data use agreement I can think of.

Roles and Responsibilities

Sometimes I have to work on creating data use agreements. It’s hard and makes me really appreciate lawyers and ethicists and all the people who professionally work on this stuff to try to make it as good as it can be.

DUAs are contracts, I think, and have lots of clauses and roles seem to vary from document to document. But it seems like the important roles fall into three groups:

  • Owner
  • User
  • Everyone in between

User is kind of clear as you just say what the allowed uses are and who can do what.

Owner is a little harder. It is really cloudy and I think I want to avoid it altogether from an operational standpoint. With the above classifications, it’s hard enough to determine what data I individually own, but when there’s lots of people, who is the owner? If there’s a class of people like “everyone who walked in front of my Wyze mailbox cam,” who is the owner? I recorded it, the people walked in a public place, I don’t know who they are. So maybe the argument is that I own the overall data, but the individuals covered have some claim to ownership over their parts. I don’t live in a GDPR jurisdiction, but I think if I did, I would need to be able to remove them from the data.

Even if I wanted to deidentify, I’d have to at least hold the data long enough to computer vision and blur out their faces. So owner is an imprecise term here.

I think I like the term “steward” better since it seems closer to the real functions since a steward has responsibilities, is accountable, but it’s not actually theirs. So this fits into the people in between the owner and the user and I think this is probably more accurate operationally.

This also helps as I think users can become stewards and we can end up with a giant chain of stewards who all need to protect the data, even as it is mixed with others and used in various ways. Even something as simple as copying a CSV file to my hard drive so I can plot a chart makes me a steward until I properly dispose of the data. And if it gets backed up and copied all over then those services must be stewards as well, turtles all the way down style.

Rights and Duties

Maybe another way to think about this is that owners have rights, even if the specific owner isn’t known. And users and stewards have duties to the owner.

If we have a right to privacy then stewards and users would have a duty to protect my privacy, even if they don’t know who I am.

While privacy isn’t completely settled, I think it’s the closest to a data right that exists. I think that I should have other rights when it comes to data.

Accessibility would be one. I think I should have the right to access my data. Generally, I would want it at all times in all formats, but practically there needs to be some guidance to prevent information blocking where technically a provider allows access, but it’s hard or expensive. In healthcare this is a big problem and if you’ve ever tried to retrieve your medical data to give to a new doctor, it’s a giant pain. Things like Project Blue Button are helping, but it’s still really hard. Healthcare is way more important than bookmarks or hypothetical, transhumanist scenarios, but is an example where if there was a data right to accessibility it would have immediate benefits.

Ecto Gammat is a concept that I think would be another. In Luc Besson’s The Fifth Element Leeloo reacts to Dallas’ kiss by almost killing him and saying “Ecto Gammat” that means “Never without my permission” (in the Celestial language or whatever fifth elements speak). It’s not that she didn’t want a kiss, and actually they fall in love, but that it was a deadly offense without permission and something that no one should ever do. I think this relates to data in that it’s not that I’m against activities but I don’t want my data used without my permission. This means for commercialization or life-saving research or for anything. Frequently I’ll hear how data are used for something “great” like targeting ads to me so they are more helpful, I want to be able to choose whether I want that to happen or not. I don’t want someone assuming it and using it without my knowledge and permission. This is particularly important as it applies even after de-identification. Both because I think many approved de-identification schemes can still identify me, but also I’m not sure if I like my data being fed into and washed in a river that makes money for random purposes.

Fifth Element Frame with Leeloo pointing gun at Dallas and saying "Ecto Gammat" meaning "Never without my permission"

I’m not sure the proper name for this, but this concept from sci-fi seems to fit. I tried to think of something like “non-commercialization” but in many cases I’m fine with commercialization. Maybe eventually we could have Creative Commons style labels for what’s allowed or not. Or maybe a cool blockchain type notation where I’m cool with commercialization as long as some femtocent goes to a charity of my choosing, even better if I can keep track of how much good it causes. I think there are some projects working on this, specifically Tim Berners-Lee’s (no relation) Inrupt Data Pods, but I don’t think any are practically functional yet.

There’s also a catch22 that if my data is properly anonymized, how can a data steward or user keep track of my intentions. Even if I’m one in a million, I think over time complementary analysis will be able to figure out which one is me by comparing all those haystacks for the single straw in common.

Compatible data

These rights could also be helpful in helping to figure out what data are compatible. So if I want to exert permissions on my Chrome sync data, but Google wants to exert other permissions maybe those don’t jive so that will help me not use that service. Or not store that kind of data because it can’t be mixed together.

And maybe accessibility being clearer helps me to choose services to use, or design muy own services for use. I like how Amazon has Glacier and Glacier Deep that has really fine grained and clear terms that help be decide on price vs. accessibility. It’s easy for me to not put data there that I want in real time. If Sync was marked up to clearly show that the data takes a long time to get out, then I could have avoided this dilemma of how to disentangle my data from Google’s service.

Wrapping up

I’m not that much clearer now than when I started, but I think the concept of ownership is not a good metaphor for data. While maybe data ownership exists in some Platonic ideal sense, practically it’s not a good metaphor as people and organizations who legally can do things frequently aren’t the owner, or don’t logically make sense as the owner.

If a hospital has my medical record in their system, do they own it? Are blood pressure measurements theirs or mine? Are doctors’ notes theirs or mine? I think it takes too much effort to try to answer these questions. Maybe it’s better to focus on the practical elements of who are the users (me, doctors, pharmacies, etc) and who is the steward (hospital) and their common duties to uphold data rights.

This also gets closer to the natural tension of data since digital things can be infinitely copied so don’t tend well to be strictly owned. And I think systems that are closer to natural technology wants are easier and better to build and use.

I think it means that we’re at the whim of remote data stewards until some rights and regulatory framework exists. I’m looking for ways to move this forward, but not holding my breath. Until then, I’ll favor remote data stewards where at least I’m the user, as opposed to advertisers, so at least there’s some economic incentive to keeping me happy.

Future Blog Ideas

  • HIPAA de-identification standards seem like they could be linked to social media datasets to reidentify in a way that would be hard for me, as a patient, to detect
  • How to reconcile data rights with the costs associated with data? If data must be accessible, how does that get charged? Should there be some compulsory license for me pulling chrome sync data out?

Stuff I Read While Working on This Post (That You Might Want to Read Too)

Credits