Homeland security’s exabyte problem
“There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”
That’s a quote attributed to Google’s chief executive officer.
An exabyte is one quintillion bytes. I can’t get my brain around one quintillion, let alone five of them. I fear zettabyte (one sextillion bytes) is right around the corner. I don’t know what any of that means. But I’m persuaded it’s a very big number.
Secretary Napolitano has talked about “the big data problem” — the volume, variety, velocity and veracity of data generated by homeland security activities exceeds the ability of the enterprise to understand what the data means. Hence “big data.”
—————————————
Writing in November’s issue of Foreign Policy, Uri Friedman offers a brief history of how we got to the big data problem. Here are some excerpts, liberally adapted from Friedman’s article.
—————————————
- 1887-1890: Modern data processing age begins. Herman Hollerith invents a machine that reads holes punched into paper cards.
- 1935-1937: Passage of the Social Security Act initiates government action to start keeping records on 26 million Americans and 3 million employers.
- 1943: The British code breakers at Bletchley Park invent Colossus, the first programmable electronic computer; Colossus can read 5,000 characters a second.
- 1961: The U.S. National Security Agency starts using computers to collect and analyze signals intelligence.
- 1965-1966: The national government considers transferring all government records — including over 700 million tax records and almost 200 million finger prints — to a single data center.
- 1974: The 1974 Privacy Act is enacted, limiting the personal information government can share.
- 1989: Tim Berners-Lee develops the idea of the World Wide Web. “The information contained would grow past a critical threshold…so that the usefulness [of] the scheme would in turn encourage its increased use.” Or, “If you build it, they will use it in unimaginable ways.”
- 1996: Bill Clinton claims “We are developing a supercomputer that will do more calculating a second than a person with a handheld calculator can do in 30,000 years.”
- 1997: Michael Cox and David Ellsworth use the term “big data” for the first time. “Data sets [they were working with] are generally quite large, taxing the capabilities of main memory, local disk, and even remote disk.… We call this the problem of big data.”
- 2002: John Poindexter leads a Department of Defense effort to combine government data sets into one “grand database” that would sift through communications, criminal, educational, financial, medical, and travel records to identify suspicious people.
- 2004: The 9/11 commission calls for a unified network-based information and intelligence sharing system.
- 2007 – 2008: Social networks proliferate. Wired magazine writes about the end of theory, “a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear” analyzing and understanding the data generated by the explosion of network activities.
- 2009: The Indian government approves a plan to fingerprint, photograph, and take an iris scan of its 1.2 billion people, and assign each person a 12 digit identification number, creating the world’s largest biometric database. The nominal purpose of the program is to “improve the delivery of government services and reduce corruption.”
- 2009: The Obama administration starts data.gov in support of its open government initiative. The website reportedly has more than 445,000 data sets.
- 2009: the United Nations announces plans to create an alert system that captures “real time data on the impact of the economic crisis on the poorest nations.” The aim is to predict “everything from spiraling prices to disease outbreaks by analyzing data from” mobile phones, social networks, and related sources.
- 2010: Each day, the national security agency intercepts and stores over 1.7 billion emails, phone calls and other communications. Walmart claims to hold over 460 terabytes of information about its customers shopping and related habits.
- 2011: IBM’s Watson computer defeats two humans on the television show Jeopardy. The computer system can scan 200 million pages of information – that’s four terabytes of data – in a few seconds.
- 2012: The Obama administration announces a “big data research and development initiative” to respond to a US government report that calls for every federal agency to have a big data strategy. The National Association of State Chief Information Officers makes the same argument for state agencies.
- 2012: Facebook has more than 900 million users, who post 300 million photographs every day, along with 3.2 billion new comments and “likes.”
- 2012: Hillary Clinton announces a public-private partnership called “Data 2X” to collect statistics on the political, economic, and social status of women and girls around the world. “Data not only measures progress – it inspires.… Once you start measuring problems, people are more inclined to take action to fix them because nobody wants to end up at the bottom of a list of rankings.”
- 2018: Everyone knows everything all the time about everyone else, but there remains a great deal of confusion and uncertainty. (I made some of that up.)
—————————————
Meanwhile, back in the King James version of Ecclesiastes, the poet writes about the big data problem this way,
“And further, my son, be admonished by these: of making many books there is no end; and much study is a weariness of the flesh.”
Or maybe — to push the risk of triggering a Palin response — the big data problem is not as new as the buzz suggests.
“The thing that hath been, it is that which shall be; and that which is done is that which shall be done: and there is no new thing under the sun.”
I am naively optimistic humans will — somehow — learn how to engage with big data, as we did with the piddling 285 terabytes of data produced by the printing press.







