Thursday, June 01, 2006

NSA Data Mining

Stunningly Schneier has come out against the NSA data mining project related to the phone call database. He may have an accurate analysis, but it isn't in this article. Personally, I think his analysis is far from imaginative and pretty much assumes that the NSA is full of really dopey people.
Data mining works best when you're searching for a well-defined profile, a reasonable number of attacks per year, and a low cost of false alarms. Credit-card fraud is one of data mining's success stories: All credit-card companies mine their transaction databases for data for spending patterns that indicate a stolen card.

Many credit-card thieves share a pattern -- purchase expensive luxury goods, purchase things that can be easily fenced, etc. -- and data mining systems can minimize the losses in many cases by shutting down the card. In addition, the cost of false alarms is only a phone call to the cardholder asking him to verify a couple of purchases. The cardholders don't even resent these phone calls -- as long as they're infrequent -- so the cost is just a few minutes of operator time.

Terrorist plots are different; there is no well-defined profile and attacks are very rare. This means that data-mining systems won't uncover any terrorist plots until they are very accurate, and that even very accurate systems will be so flooded with false alarms that they will be useless.

and
Let's look at some numbers. We'll be optimistic -- we'll assume the system has a one in 100 false-positive rate (99 percent accurate), and a one in 1,000 false-negative rate (99.9 percent accurate). Assume 1 trillion possible indicators to sift through: that's about 10 events -- e-mails, phone calls, purchases, Web destinations, whatever -- per person in the United States per day. Also assume that 10 of them actually indicate terrorists plotting.

This unrealistically accurate system will generate 1 billion false alarms for every real terrorist plot it uncovers. Every day, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Clearly ridiculous.

This isn't anything new. In statistics, it's called the "base rate fallacy," and it applies in other domains as well. And this is exactly the sort of thing we saw with the National Security Agency (NSA) eavesdropping program: The New York Times reported that the computers spat out thousands of tips per month. Every one of them turned out to be a false alarm, at enormous cost in money and civil liberties.

I think Schneier has made a basic logical mistake. He appears to have assumed that the output of the original data mining is an end product. I would say that is ludicrous. In all intelligence fields you start with a base product and put it through a refining process. That gives you a finer product, but still not the final analysis. That data is then put against other database information on terrorists and affiliated organizations, people, criminals, etc. This provides information that can then be worked to bring to the human analysis level. At that point the highest probability cases are analyzed and action is determined based on more information than just a telephone call.

There also is the assumption in that last paragraph that the NYTimes got their information correct. Something I am highly skeptical about. Just because the NYTimes has an anonymous source stating that all of the tips were false, doesn't mean it is true.

Obviously the data mining isn't going to come up with a direct hit in every run. Nor in every 100,000 runs. But the information can show where there are activities that law enforcement can watch in a more general way to relate to localized activities. Information in this case can assist the law enforcement units to take additional precautions when and where suspicious activities have been noted at a more general level.

The NSA is full of very very intelligent people. I think assuming that they are just doing this data mining for busy work is rather short sighted. Not to mention that data mining technology gets better when people investigate how to make better modeling programs related to certain topics. If anything, the present exercise should help the NSA make more effective programs to mine information in a more timely and accurate fashion.


No comments: