The world of Big Data is a world of pervasive data collection and aggressive analytics. Some see the future and cheer it on; others rebel. Behind it all lurks a question most of us are asking—does it really matter? I had a chance to find out recently, as I got to see what Acxiom, a large-scale commercial data aggregator, had collected about me.
At least in theory large scale data collection matters quite a bit. Large data sets can be used to create social network maps and can form the seeds for link analysis of connections between individuals . Some see this as a good thing; others as a bad one—but whatever your viewpoint, we live in a world which sees increasing power and utility in Big Data’s large scale data sets.
Of course much of the concern is about government collection. But it’s difficult to assess just how useful this sort of data collection by the government is because, of course, most governmental data collection projects are classified. The good news, however, is that we can begin to test the utility of the program in the private sector arena—a useful analog in the private sector just became publicly available and it’s both moderately amusing and instructive to use it as a lens for thinking about Big Data.
Acxiom is one of the largest commercial, private sector data aggregators around. It collects and sells large data sets about consumers (sometimes even to the government). And for years it did so quietly, behind the scene—as one writer put it “mapping the consumer genome .” Some saw this as rather ominous; others as just curious. But it was, for all of us, mysterious.
Until now. In September the data giant made available to the public a portion of its data set . They created a new website—Abouthedata.com —where a consumer (like you or I) could go to see what data the company had collected about him or herself. Of course, in order to access the data about yourself you had to first verify your own identity [I had to send in a photocopy of my driver’s license], but once you had done so, it would be possible to see, in broad terms, what the company thought it knew about you … and how close that knowledge was to reality.
I was curious, so I thought I would go and explore myself and see what it was they knew and how accurate they were. The results were at times interesting, illuminating, and mundane. Herewith a few observations:
To begin with, the fundamental purpose of the data collection is to sell me things—that’s what potential sellers want to know about potential buyers and what, say, Amazon might want to know about me. So I first went and looked at a category called “Household Purchase Data”—in other words what I had bought recently.
It turns out that I buy … well… everything. I buy food, beverages, art, computing equipment, magazines, men’s clothing, stationary, health products, electronic products, sports and leisure products, and so forth. In other words my purchasing habits were, to Acxiom, just an undifferentiated mass. Save for the notation that I had bought an antique in the past and that I have purchased “High Ticket Merchandise,” it seems like almost everything I buy is something that most any moderately well-to-do consumer would buy.
I do suppose that the wide variety of purchases I made is, itself, the point—by purchasing so widely I self-identify as a “good” consumer. But if that’s the point then the data set seems to miss the mark on “how good” I really am. Under the category of “total dollars spent” for example, it said that I had spent just $1898 in the past two years. Without disclosing too much about my spending habits in this public forum I think it is fair to say that this is a significant underestimate of my purchasing activity.
The next data category of “Household Interests” was equally un-illuminating. Acxiom correctly said I was interested in computers; arts; cooking; reading; and the like. It noted that I was interested in children’s items (for my grandkids) and beauty items and gardening (both my wife’s interest, probably confused with mine). Here, as well, there was little differentiation and I assume that the breadth of my interests is what matters rather that the details. So, as a consumer, examining what was collected about me seemed to disclose only a fairly anodyne level of detail.
[Though I must object to the suggestion that I am an Apple user J. Anyone who knows me knows I prefer the Windows OS. I assume this was also the result of confusion within the household and a reflection of my wife’s Apple use. As an aside, I was invited throughout to correct any data that was in error. This I chose not to do, as I did not want to validate data for Acxiom – that’s their job not mine—and I had no real interest in enhancing their ability to sell me to other marketers. On the other hand I also did not take the opportunity they offered to completely opt-out of their data system, on the theory that a moderate amount of data in the world about me may actually lead to being offered some things I want to purchase.]
Things became a bit more intrusive (and interesting) when I started to look at my “Characteristic Data”—that is data about who I am. Some of the mistakes were a bit laughable—they pegged me as of German ethnicity (because of my last name, naturally) when, with all due respect to my many German friends, that isn’t something I’d ever say about myself. And they got my birthday wrong—lord knows why.
But some of their insights were at least moderately invasive of my privacy, and highly accurate. Acxiom “inferred” for example, that I’m married. They identified me accurately as a Republican (but notably not necessarily based on voter registration—instead it was the party I was “associated with by voter registration or as a supporter”). They knew there were no children in my household (all grown up) and that I run a small business and frequently work from home. And they knew which sorts of charities we supported (from surveys, online registrations and purchasing activity). Pretty accurate, I’d say.
Finally, it was completely unsurprising that the most accurate data about me was closely related to the most easily measurable and widely reported aspect of my life (at least in the digital world)—namely my willingness to dive into the digital financial marketplace. Acxiom knew that I had several credit cards and used them regularly. It had a broadly accurate understanding of my household total income range [I’m not saying!]
They also knew all about my house—which makes sense since real estate and liens are all matters of public record. They knew I was a home owner and what the assessed value was. The data showed, accurately, that I had a single family dwelling and that I’d lived there longer than 14 years. It disclosed how old my house was (though with the rather imprecise range of having been built between 1900 and 1940). And, of course, they knew what my mortgage was, and thus had a good estimate of the equity I had in my home.
So what did I learn from this exercise?
In some ways, very little. Nothing in the database surprised me and the level of detail was only somewhat discomfiting. Indeed, I was more struck by how uninformative the database was than how detailed it was—what, after all, does anyone learn by knowing that I like to read? Perhaps Amazon will push me book ads, but they already know I like to read because I buy directly from them. If they had asserted that I like science fiction novels or romantic comedy movies, that level of detail might have demonstrated a deeper grasp of who I am—but that I read at all seems pretty trivial information about me.
I do, of course, understand that Acxiom has not completely lifted the curtains on its data holdings. All we see at About The Data is summary information. You don’t get to look at the underlying data elements. But even so, if that’s the best they can do ….
In fact, what struck me most forcefully was (to borrow a phrase from Hannah Arendt) the banality of it all. Some, like me, see great promise in big data analytics as a way of identifying terrorists or tracking disease. Others, with greater privacy concerns, look at big data and see Big Brother. But when I dove into one big data set (albeit only partially), held by one of the largest data aggregators in the world, all I really was, was a bit bored.
Maybe that’s what they wanted as a way of reassuring me. If so, Acxiom succeeded, in spades.
Paul Rosenzweig is a former Carnegie Fellow for the Medill National Security Journalism Initiative. He is author of “Cyber Warfare: How Conflicts in Cyberspace Are Challenging America and Changing the World” and former deputy assistant secretary of the Department of Homeland Security.
]]>The expanding use of drones over U.S. airspace has become a fast-growing national security topic and privacy concern. We asked our colleague Paul Rosenzweig, who co-authored a recent Heritage Foundation paper on drones, to weigh in.
Flying drones—unmanned aerial vehicles—have been made famous by their use in the war on terrorism, notably through lethal strikes on terrorist targets in Iraq and Afghanistan. Such drones are a small fraction of those used by the United States today. There are thousands of drones, used for a wide variety of purposes, from scientific research to military operations.
Drones, mostly without a weapons capability, are employed by the government and the private sector. Because of the drones’ wide-reaching surveillance capabilities, however, even unarmed drones could threaten personal privacy and civil liberties. As the FAA develops regulations for the operation of drones in domestic skies, it must take into account constitutional concerns and privacy rights.
In a research paper for the Heritage Foundation, (see bottom of story) I and three colleagues gave some thought to what the right set of rules and regulations might look like. Here, in summary are some thoughts about how drones should be used domestically.
First, we should consider the redlines or limitations:
On the other hand, there are plenty of situations where domestic drone use is both sensible and practical and where our effort should be to authorize and regulate the use, rather than limit it or restrict it. For example, consider these uses:
The list can go on. And of course in any such listing there will be value judgments and choices that need to be made. So the fundamental recommendation that the paper makes is for Congressional engagement. After all, these are interesting, difficult and indeterminate choices we will be making – isn’t that what our elected representatives are for?
[field name=”drones”]
]]>
Good thing then that a new DARPA project has the same name!
DARPA (the Defense Advanced Research Project Administration) recently announced that it would be funding a project known as ADAMS (Anomaly Detection at Multiple Scales). According the Homeland Security Newswire, “Researchers in a 2-year, $9 million project will create a suite of algorithms that can detect multiple types of insider threats by analyzing massive amounts of data — including email, text messages and file transfers — for unusual activity.”
That sounds a lot like the old Total Information Awareness program, a controversial data analysis program shelved by DARPA at Congressional insistence in 2004.
ADAMS seems like a new form government data mining, that may also be subject to Congressional scrutiny. To learn more about data mining, visit “The Data Minefield,” Medill’s web-based resource bank for journalists and citizens who want a practical understanding of the issues.
]]>The government says that it didn’t need one. They argue that a person has no reasonable expectation of privacy in his travel on public roads. After all, they argue, the police could have tailed Jones in an unmarked vehicle and they wouldn’t have needed a warrant. Jones argues, however, that GPS tracking devices are uniquely intrusive — that they allow the government to collect a large volume of geo-location tracking data and use it to build a “mosaic” picture of a person, learning, for example, what church he goes to; what bar he drinks at; and whether or not he is a regular gym attendee.
The arguments before the Court reflected the challenge that the Justices will face in drawing lines. They were concerned, for example, by the government’s acknowledgement that the logic of its argument would place no Constitutional limits on the ability to put a GPS in literally every car in America — including even the cars driven by the Justices.
On the other hand, Jones’ attorney had difficulty in drawing a firm line between visual surveillance and GPS surveillance. Did the Fourth Amendment require the government to use an inefficient method?, the Justices asked. Repeatedly the Court returned to the idea that the resolution for the problem might lie in legislation, rather than a constitutional rule.
The government lost the case in the appellate court and the Supreme Court will issue its opinion before June of next year. Observers in the Court (including me) are deeply uncertain as to the ultimate result.
The Medill National Security Journalism Initiative has recently completed a project to develop resources for journalists on all sorts of issues related to the phenomenon of data mining — we call it “The Data Minefield.” At the Minefield, reporters can learn the science of data mining; review new technologies; hear from the experts on how to find stories; and browse through a collection of resources about government data mining projects. The Jones case is just one example. If you are interested in the topic, or just curious, The Data Minefield is for you.
]]>My wife and I are on holiday in New Zealand and earlier today we took a domestic flight from Wellington to Nelson. It was a short commuter hop — 30 minutes, across the strait separating the North and South Islands. On the whole an utterly unremarkable experience, just like any number of flights we’ve taken before.
Save for one thing — no security. We arrived for the flights with our e-tickets in hand, scanned them at a kiosk, dropped our bags off on the conveyor and walked to the gate. No ID check; no metal detector; no X-ray of our carry on bags. Probably no X-ray of the checked luggage but we couldn’t tell for sure. We scanned our boarding passes again at the gate, but no ID check. Nothing. In short, it felt like something from before 9/11 — and possibly even before the 1980s and the advent of hijacking.
It was so remarkable precisely because it was so unusual. But as I sat thinking about it, I realized that it really was the reflection of a wise policy. Had we been boarding for an international flight, we’d have gotten the whole 9 yards — ID, bag check, metal detector etc. But New Zealand has made the risk assessment that their inter-island domestic flights are not realistic targets. Though in theory just as vulnerable as any other flight, there simply isn’t a real threat of attack on that particular target.
And so they don’t waste time and money protecting something that doesn’t need protecting. It’s comforting to see that kind of rationality in the world — even if we had to come all the way to New Zealand to see it.
]]>In the end, maybe more than we realize. Today’s cloud system uses “thin clients” — simple interfaces like Google’s Chrome system — with minimal independent computing power. All of the data, software, and operating systems, software, and processing resources are stored in the cloud, managed by a cloud system administrator.
If that sounds familiar, it should. We are, quickly, recapturing the system configuration of the early 1980s, when dumb terminals (little more than a screen and a keyboard) connected to a mainframe maintained by a systems administrator. The administrator made the resource allocation decisions, prioritized work and controlled access to the processing systems. So the translation is clear: thin client = dumb terminal; cloud = mainframe.
That centralized system of control if fundamentally authoritarian. Today’s internet structure empowers individuals. On my laptop I have more processing power and data storage capacity than imaginable. From here I can link to the web and communicate with anyone. I choose my own software, save my own data, and innovate as and when I please.
In a cloud system or the old mainframe system, I make none of those decisions — my software is provided by the system administrator; who stores my data and controls what new innovations are made available. That’s a fundamentally authoritarian model where I lose much of the independence that has made the web a fountain of innovation and invention.
In a liberal western democracy, perhaps that is not a problem — after all, I don’t have to choose Google as my cloud provider if I don’t want to. But in more authoritarian states, the trend toward the cloud will make citizens even less able to control their own destiny. The internet empowered the liberty of dissent; we should be concerned that the cloud may take it away.
]]>But what’s really a mess is how our Representatives (and, sometimes, the press) report these sorts of numbers. They are always portrayed as absolute values and in that abstract context they seem immense. Who, after all, could approve of 2,500 mistakes per year?
But the abstract context is just that — abstract. Numbers have meaning only in a concrete context. So how about this for context: Domestically, there are approximately 2 million enplanements (passengers boarding aircraft) every day. That’s roughly 700 million passengers a year, or 7 billion passengers in the 10 years for which the security breach data are reported (and bear in mind that this is every security breach however minor). That’s an error rate of less than 0.0001%. In what human endeavor is that considered a poor performance?
So … absolute numbers mislead. What we really need to know is what the types of breaches are, how systematic the vulnerabilities are and what, if anything, TSA is doing to fix the issues identified? Those are interesting questions. Talking about numbers out of context is not.
]]>When disaster strikes, a large number of resources need to be mobilized. The larger the disaster, the more resources are needed, and the greater the need for coordination. But given how infrequent large-scale disasters are (thankfully!) we don’t have a lot of practice with that sort of coordination.
The Federal government runs a robust training and exercise program that models disaster response by having all the players respond to a hypothetical disaster. They run both small regional programs and, annually, a National Level Exercise that models a major catastrophe. This year, NLE 2011 is an exercise that asks “what would happen if we had a major earthquake along the New Madrid fault line in the Midwest?” The three-day exercise is scheduled to begin today.
The exercise is designed to test a range of government and private sector functions including: Communications, Logistics, Mass Care, Medical Surge Capacity, Evacuation, Emergency Public Information and Warning, and the activation of an Emergency Operations Center. Participants will include the Federal government, eight states, dozens of local governments and the private sector. When exercises like this are done right, they identify gaps and overlaps, produce operational familiarity and develop useful relationships.
As useful as planning and exercises are, sometimes reality intrudes. Many States in the Midwest are in the middle of dealing with an historic flood of the Mississippi. Others are still recovering from recent devastating tornadoes. So even though “the show must go on” it will have to go on without them.
Instead of joining the national exercise with first responders simulating their activities, Tennessee, Alabama, and Mississippi will have to play the game virtually, sitting around a table making decisions; they’re busy dealing with real life.
The rest of the states will play the game, responding to a simulated disaster, as if an earthquake had actually occurred.
Meanwhile their colleagues in the South will go about the grim business of responding to the real disasters of tornadoes and flooding — and in their own way, demonstrating why planning and exercises are so vital.
]]>