Monday, December 21, 2009

Netflix: Outing the Gay and Lesbian community since 2006.

Privacy policies - almost nobody reads them. When it comes to social networks and online services they almost all give the service provider the right to release "anonymized" data. Several places reported today that a class action suit against Netflix has been initiated because the data they are releasing can actually be tracked back to the original user. I first read of it in Wired's Threat Level blog, but one of the most detailed stories is at Ars Technica.

It seems the problem stems from a contest Netflix launched in 2006. It released two sets of data for contestants to manipulate. The goal was for someone to design an algorithm that would be 10% better at predicting the reviews a person would make for other movies based on the review they gave movie(s) in the data sets. The problem is, video rental data is legally among the most protected in the U.S. The allegation is that by releasing the "anonymized" data Netflix violated those laws. One of the plaintiffs is an in-the-closet lesbian mother who fears that the data released could out her and have bad effects on her ability to support her family. She has good reason to be concerned. The Netflix context took place a few months after "anonymized" data from AOL was used by reporters to identify AOL users. So it really wasn't very surprising that just a few weeks after Netflix started it's contest researchers were able to identify Netflix users - along with their political leanings and sexual orientation. Oops.

The second part of the lawsuit seeks to prevent the launch of the next contest. Living proof that stupidity is a life long problem (and corporations can live a long time), Netflix wants to provide more "anonymized" data this time. And that data will include zip code, age, and gender. When you combine that with the movie ratings and ID numbers it will be more than enough data to ID Netflix customers. Again.

The bad thing about all of this...well, one of the bad things, is that it has been obvious for years that the traditional 'scrubbing' of data is no longer adequate for anonymizing. Mark Dixon looks into the history of re-identifying data and sees that if data continues to be handled the way it is now, every time any company releases anonymized data they are releasing re-identifiable data.

Unless you are up for canonization by the Catholic Church, that should scare the bejeezus out of you.
<!-- /* Font Definitions */ @font-face {font-family:Times; panose-1:2 0 5 0 0 0 0 0 0 0; mso-font-charset:0; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:3 0 0 0 16777216 0;} @font-face {font-family:Cambria; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Times New Roman"; mso-font-charset:77; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:auto; mso-font-signature:3 0 0 0 16777216 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin-top:0in; margin-right:0in; margin-bottom:10.0pt; margin-left:0in; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Cambria; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} a:link, span.MsoHyperlink {color:blue; text-decoration:underline; text-underline:single;} a:visited, span.MsoHyperlinkFollowed {mso-style-noshow:yes; color:purple; text-decoration:underline; text-underline:single;} p {mso-margin-top-alt:auto; margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ascii-font-family:Times; mso-fareast-font-family:Cambria; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Times; mso-bidi-font-family:"Times New Roman";} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --> Privacy policies - almost nobody reads them. When it comes to social networks and online services they almost all give the service provider the right to release "anonymized" data. Several places reported today that a class action suit against Netflix has been initiated because the data they are releasing can actually be tracked back to the original user. I first read of it in Wired's Threat Level blog, but one of the most detailed stories is at Ars Technica.

It seems the problem stems from a contest Netflix launched in 2006. It released two sets of data for contestants to manipulate. The goal was for someone to design an algorithm that would be 10% better at predicting the reviews a person would make for other movies based on the review they gave movie(s) in the data sets. The problem is, video rental data is legally among the most protected in the U.S. The allegation is that by releasing the "anonymized" data Netflix violated those laws. One of the plaintiffs is an in-the-closet lesbian mother who fears that the data released could out her and have bad effects on her ability to support her family. She has good reason to be concerned. The Netflix context took place a few months after "anonymized" data from AOL was used by reporters to identify AOL users. So it really wasn't very surprising that just a few weeks after Netflix started it's contest researchers were able to identify Netflix users - along with their political leanings and sexual orientation. Oops.

The second part of the lawsuit seeks to prevent the launch of the next contest. Living proof that stupidity is a life long problem (and corporations can live a long time), Netflix wants to provide more "anonymized" data this time. And that data will include zip code, age, and gender. When you combine that with the movie ratings and ID numbers it will be more than enough data to ID Netflix customers. Again.

The bad thing about all of this...well, one of the bad things, is that it has been obvious for years that the traditional 'scrubbing' of data is no longer adequate for anonymizing. Mark Dixon looks into the history of re-identifying data and sees that if data continues to be handled the way it is now, every time any company releases anonymized data they are releasing re-identifiable data.

Unless you are a very unusual individual, that should scare the bejeezus out of you.