A research site at America Online posted three months of search records for 500,000 people (over 20 million searches) on the Internet recently. The data was discovered over the weekend and news of it has quickly spread across the blogosphere and into the mainstream media. AOL rapidly removed the data from its site, but the cat’s already out of the bag – the files were copied, and have been replicated all over the Internet.
Anyone can download the 439mb file, just like I did last night. People are already poring through the data, finding some very disturbing search patterns among a number of AOL’s users. In theory, there is no personally-identifiable information on the database, but if people ran searches that identify things about themselves, it often becomes easy to figure out who they are. In many ways, this is a worse privacy loss than the laptop stolen from the Veterans Administration employee earlier this spring, if it had been compromised.
This inadvertent disclosure of data forces the need for a public debate on the retention and use of search data by private companies, and the propriety of its use by government agencies. In January we learned that Google refused a DOJ subpoena to supply the government with exactly this kind of data – a request with which Yahoo!, AOL and MSN complied. These companies are compiling petabytes of search data on their servers, effectively archiving the collective subconscious of hundreds of millions of people.
This information clearly has value from a marketing and business intelligence perspective, which is why the search companies are retaining it. But this data then becomes an overly tempting target for homeland security and counterterrorism officials. Should they able to access it? Under what conditions? By whom? And what is the actual value of the search information? We need to answer these questions, and in doing so develop a clear framework to guide how and when such information should be available to government officials, rather than continuing along in the legal and policy vacuum that the United States is in today.
We need a framework that allows narrow access to this search data in cases where a person or group is under investigation for activities related to terrorism, counterintelligence, and/or WMD proliferation. But I would forbid access to this search data for the purpose of conducting wide-ranging analysis of search data – looking for needles in the haystack – because the benefits would not be nearly commensurate with the massive privacy hit. And the search companies need to be more responsible in their utilization of this data, and develop policies and systems for destroying data after a finite period of time (1-2 years), and give users the ability to clear and remove personally-identifiable search histories from company servers.
This assessment is based in part on some cursory analysis of the AOL data last night. In cases where I found “suspicious” searches, I could never be certain about the actual intent of the search. This inability to divine intent from searches will naturally lead to high percentages of false positives. For example, anyone who works in the homeland security field, as I do, is likely to run searches related to terrorist tactics, infrastructure protection, etc. These searches are all false positives, and likely will drown out any “real” terrorist search activity. Efforts to investigate these searches would therefore be expensive, and less productive than traditional means of intelligence and investigation.
If the federal government is allowed unfettered access to this data, we run the risk of creating a new Orwellism – Searchcrime – that is an inefficient response to the war on terror.