Edward Snowden has shown that he’s still an almighty pain in GCHQ’s backside by leaking a document that describes the spy agency’s approach to data-collection. The ‘Data Mining Research Problem Book’ is essentially a top secret manual designed to help spies, well, spy.
While there’s too much online information for GCHQ to properly sift through -- meaning that the vast majority of content simply needs to be discarded -- the doc explains that all metadata can be retained. That essentially means that GCHQ is pulling in absolutely everything it can pull in, because who's going to stop it?
There are extremely stringent legal and policy constraints on what we can do with content, but we are much freer in how we can store and use metadata. Moreover, there is obviously a much higher volume of content than metadata. For these reasons, metadata feeds will usually be unselected—we pull everything we see; on the other hand, we generally only process content that we have a good reason to target.
The handbook makes reference to ‘false positives’ too, advising spies to only chase leads when the chances of a result are high.
It is important to point out that tolerance for false positives is very low: if an analyst is presented with three leads to look at, one of which is probably of interest, then they might have the time to follow that up. If they get a list of three hundred, five of which are probably of interest, then that is not much use to them.
It’s worth mentioning that the document was penned back in 2011, and GCHQ’s techniques are likely to have evolved since then. The handbook was created by researchers from the Heilbronn Institute for Mathematical Research in Bristol, a partnership between GCHQ and the University of Bristol. [ArsTechnica, BoingBoing]