Could Tube Wifi Data Be Used To Identify And Track Individuals?

By James O Malley on at

On the 15th September, London sadly became the target of terror again. At around 8:20am, there was an explosion on board a packed District Line train at Parsons Green station. Mercifully, no one was killed - though 30 people were taken to hospital for treatment, mostly for burns. Over the weekend that followed, there was an enormous manhunt looking for the suspected perpetrators, and the terror threat level was raised to “critical” whilst the police did their thing. At the time of writing, unsurprisingly, the case is ongoing.

A few days earlier, Transport for London (TfL) - the body which runs the Tube - happened to release an intriguing report, looking into the findings of a trial in which, for a month at the end of last year, wifi signals were used to track passenger journeys across the network. The idea is that as we travel across the Tube network, wifi beacons in stations would detect the unique ID - the MAC address - of our phones, tablets and other devices - even if we’re not connected to the Tube’s wifi network. (It’s nothing to do with Apple Macs - MAC simply stands for “media access control”, and every wifi device has one.)

As we explained in detail, the hope for transport planners is that this data can then be analysed in order to better understand the journeys people make on the tube, and this can then inform how demand is managed in the future (for example, with route-planning apps recommending passengers take less crowded routes). Judging by the results, the technology has the potential to be genuinely transformative - and could enable London to squeeze even more capacity out of its existing transport infrastructure. And anyone who has ever caught the Tube at rush hour can appreciate that.

Off the back of the report, unsurprisingly, TfL is now trying to figure out how to switch on this sort of tracking full time on the tube network.

Taken together, these two stories raises some interesting questions: What if the police, security services or other bodies wanted access to this data? Could they get it? And what would that mean for both the fight against terrorism and our privacy?

Specifically in the case of Parson’s Green, it makes one wonder: What if this sort of tracking had been active on the day of the Parson’s Green bombing? Could the police conceivably have obtained the bomber’s location data from TfL’s wifi tracking system - and then used this data in their investigation, as part of attempts to locate the perpetrator?

Officially, the answer is “no”. When I put this hypothetical question to TfL’s Chief Data Officer Lauren Sager Weinstein she repeatedly emphasised that for legal reasons TfL would be unable to share the data it has collected. “We had said we were collecting the data for this purpose [ie: doing all of the clever transport pattern wizardry] and we would not share it with anybody. That is the condition on which we had collected the information”, she explained to me.

She also pointed out in no uncertain terms that during the trial last year, not only were no requests for data received at any point, but at the end of the trial all of the collected data has been deleted.

So it seems like an open and shut case, right? Where this gets interesting is the difference between what is legally possible and what is technically possible. This is important, because of what it means for our privacy: Is the only thing protecting us the laws and rules that govern TfL’s behaviour? Or are we also protected by technical constraints? The laws of data protection, after all, are easier to change than the laws of mathematics.

Let’s dig into how the data was stored - and examine whether our tracking data could conceivably be extracted from TfL’s database if they use the same setup if and when the system rolls out full time.

With such a vast amount of data collected, it is reassuring that it is clear from the official report on the trial, TfL has tried to behave responsibly when it comes to safeguarding data and our privacy.

The report says that the only data from each user it collected was “An encrypted, depersonalised version of the device MAC address, the date and time the device broadcast its MAC address, the access point it connected to, the device manufacturer and the device association type.”

Crucially, this includes “encrypted, depersonalised” version of the MAC address - which on the next page it describes as “pseudonymised”.

All of these images are mock-ups by us doing our best impression of minority report.

So does this mean that - hypothetically if the wifi tracking had been running - if the police had phoned up asking for tracking data with which to identify or locate the bomber, they would have been technically (if not legally) able to hand over the data? Again according to the report, TfL seems to think not, saying that:

“As we cannot process known MAC addresses in the same manner as we did in the pilot, we are unable to complete any Subject Access Request for the data we collected.”

In other words, TfL’s report is claiming that though it has MAC address data, it has been transformed in how it is stored, and translating it backwards isn’t possible. (For the avoidance of doubt, back in the real world TfL has long-deleted the wifi tracking data it collected for the trial, in case you’re wondering.)

The reality, however, is more complicated.

Dr Lukasz Olejnik is an independent cybersecurity and privacy consultant and a researcher. He’s been taking a look at the official TfL report, as well as a number of documents I obtained via a Freedom Of Information request earlier in the year, about the trial. He thinks there is some confusion in what TfL has said about how it has collected data. And given what we know about how the data was collected, it does seem to him as though it is conceivable that TfL’s data could be reverse-engineered to reveal the original MAC addresses - even if TfL themselves do not realise this yet.

For example, TfL says that it stores “hashed” MAC addresses. This is where a device’s MAC address, which is a random string of bytes, for example “00-14-22-01-23-45”, is encrypted so that it is stored as, say, “‘x1Jx7F893lL4jO” (to use one of TfL’s examples of a hashed MAC address). Crucially, hashing only works in one direction, so there’s no way that a cryptographic algorithm could transform “x1Jx7F893lL4jO” back to “00-14-22-01-23-45”.

Dr Olejnik calls this sort of one-way encryption a “cornerstone of modern cryptography”, as it is the reason things like passwords work to keep our data safe. So hashing sounds pretty secure, right?

One very crude way of getting around hashing might be to “brute force” it. If you know the mathematics that are doing the encryption, you can work it out backwards, by encrypting every possible combination (into what they call a “lookup table”) and comparing the results.

Imagine you wanted to know every possible hash for a four digit PIN number for example. You’d start by hashing 0000, then 0001, then 0002 and so on, all the way to 9999. Once you’ve got them all, you can simply look at the list of hashed pin numbers next to the data associated with them (say, the person’s name) and suddenly you have your target’s PIN number.

Looks pretty spooky with these MAC addresses overlaid, right? Obviously TfL's real systems couldn't do this.

With hashed MAC addresses it could work in the same way. In theory, there are 281,474,976,710,656 different combinations of MAC address (because there are 12 digits rather than the 4 we use for PIN numbers). But in practice, unlike password storage where there a similarly astronomical number of combinations that would make a brute force attack too time consuming to do, there are many fewer combinations of MAC addresses because of the way they are allocated.

Each device manufacturer - such as Apple and Samsung - is allocated a specific block of addresses they can use on their devices, meaning that all iPhones will contain certain Apple-owned combinations of characters, and all Galaxy Note 8s will have Samsung-owned combinations, and so on. If you’re interested in tracking down smartphones on the Tube, Dr Olejnik notes that you could conceivably “precompute” them all.

So imagine this system was up and running with hashed MAC addresses and the police wanted to find someone on the tube network. Assuming they knew the MAC address of a suspect’s device, they could simply use a lookup table of precomputed MAC addresses to extract from TfL’s massive dataset the movements of the suspect.

To further protect the data, in the trial TfL didn’t simply hash the MAC addresses - it added what is called a “salt” to the process. In essence, this is essentially an extra piece of mathematics in the hashing formula, which makes the data harder to reverse engineer with a lookup table, as described above.

For example, imagine if on the end of each MAC address there was an extra string of characters that is included in the hash. So instead of hashing “00-14-22-01-23-45”, the ID being hashed is instead “00-14-22-01-23-45-Bananas”, where “Bananas” is the salt. This means that if someone wants to reverse-engineer the hashes for each MAC address, they will also need to know that “Bananas” is the salt, otherwise the hashes they generate won’t match the ones in the TfL database.

We even made a crosshairs look a bit like a tube map.

By using a salt - which TfL notes is in line with guidance provided by the government’s Information Commissioner - it adds an extra layer of security. Though does it make the data completely anonymised? Not really - as if someone knows the salt and the hashing mechanism - they could still conceivably build a lookup table and access wifi tracking data.

So the question becomes: who knows what the salt is? This is where Dr Olejnik has spotted some mixed messages from TfL.

In the official report, TfL says that “The salt is not known by any individual and was destroyed on the day the data collection ended.”

But in the Data Privacy Impact Assessment that we obtained via FOI, it notes that “Very few people will know the salt string”. This is important because “very few people” is different to no one - it might be the difference between anonymisation and pseudonymisation. And it could have a profound impact on whether this tracking data could be used to identify and track individuals.

For example, take Parsons Green again. If the police were to make a formal request and ask for the tracking data for a given MAC address, if the salt is not known by anyone, retrieving that information should essentially be impossible. But if it is known by even just one individual - whether an executive, a manager or the person in IT who maintains the database - it means that the hashing could be cracked wide open, and the tracking data could be retrieved.

In the process of writing this article, I asked Lauren about the salt, and she confirmed to me that it was “blindly typed” for the trial, but obviously in order for the salt to work with the system, it also had to be stored in a file on the system - which could have conceivably been accessed (but wasn’t). At the end of the trial, it was deleted. So in essence, if this hypothetical scenario, if the same system was used, the salt could have been accessible.

So as things currently stand, it appears that the ability of TfL to access and share our data is not a technical limitation, but a legal one (albeit with several technological tricks to pseudonymise the data as described above). This is important because it plays into a wider debate about tracking and privacy. After all, the Tube is London’s central nervous system. Every day the system handles 5 million passenger journeys. By logging these journeys, TfL is in effect building a database that follows our movements everywhere, on a daily basis. If you can be traced using your MAC address data, are you comfortable having your movements tracked to this extent?

Interestingly too, it isn’t simply MAC address data that could conceivably be used for this purpose. “Research suggests that individual mobility patterns exhibit unique patterns”, explains Dr Olejnik, adding, “This risk might be limited in the case of TfL due to the numbers of commuters, especially in peak hours. [...] Can we still say the same on a late evening hour, and a relatively less used line and station”?

In other words, even with scrambled MAC address data, your travel patterns alone could single you out. While a station like Euston might afford you anonymity just through strength in numbers, if you’re a regular user of Roding Valley, the least used station on the line, that might enable you to be more easily identified.

Heck, it is possible to imagine a situation where technically, if not legally, even if the MAC address of the tracking target was unknown, if the tracking system were running, it might have been possible to call up all of the hashed MAC addresses that were detected at Parson’s Green at the time the bomb went off - and then work all of the passenger journeys backwards to the Oyster gatelines, when everyone on the train would have been captured on CCTV or have had their Oyster card logged as they touched in. Thus providing a means of linking encoded data to real identities.

Plot twist! Call me, Hollywood graphic designers.

I framed this piece by using the Parsons Green bombing as an example when tracking data might have a use outside of managing the Tube network - but this almost literal “ticking time bomb” scenario is very different to when authorities may otherwise want to use the data.

For example, if the data could be reverse engineered, it is easy to imagine that - if the legal regime allowed it - the data could be used in police investigations to prove that people were where they said they were. And while it might be hard for non-privacy nerds to object to using the technology to track suspected terrorists, we might feel differently if the technology was used to track people suspected of lesser offences or for other reasons. If this data was accessible, would it be worth the privacy trade-off to use it to track down fare evaders, for example?

It may seem like a weird, hypothetical thing to worry about - but in my view these are important considerations because once the tools are in place, there will inevitably be a temptation to make use of them. (You might trust the current Prime Minister not to do anything too Orwellian with such powers - but how can you be sure that you will trust whoever is Prime Minister in 10 or 20 years time?).

This privacy debate is one that is now inevitably going to play out as TfL works to switch on the tracking full time. And for its part, Dr Olejnik thinks that the organisation has handled the privacy issues with genuinely good intentions:

“They attempted to follow a data protection impact assessment [DPIA] as a process, and even conducted focus groups tests (though they did not disclose how they’d chosen their participants). There is some evidence in their DPIA of identifying risk points. Here, I applaud TfL, as many organisations are not so vigilant.”

Similarly Elizabeth Denham, the Information Commissioner, who is responsible for regulating data protection compliance, is full of praise for TfL’s wifi trial. Speaking to the London Assembly on the 14th September, she said that the trial was “a really good example of a public body coming forward with a plan, a new initiative, consulting us deeply and doing a proper privacy impact assessment.”

She went on to say that it was “a good example of privacy by design and good conversations with the regulator to try to get it right. There is a lot of effort there.”

Though this said, in Dr Olejnik’s view, the DPIA they carried out was unsatisfactory, and he hopes that if TfL roll out wifi tracking full time, they make some changes. “I hope TfL will re-do a DPIA prepare an adequate risk management assessment, extensively identifying individual privacy risks and challenges and providing remediation actions. Designing an anonymisation or pseudonymisation system from scratch and documenting the process would be a good start”, he says.

TfL, for its part, says that that DPIA is a “living document” - and that before rolling the technology out full time, the intention is to consult widely.

“We actually want to talk to privacy groups and security experts to work out how do you do this”, explains Lauren Sager Weinstein. “The key question is here is you want to have enough information about trends to answer the questions we want to answer but the whole premise of data protection is to only keep data as long as you need it. so that’s what we need to figure out”.

Could the tracking system have stopped the Parsons Green attack? Would the tracking enable the identification of individuals, with all of the privacy concerns that entails if the system was rolled out full time? If TfL uses the same system it used for the trial, it appears that while there are legal restrictions, there are only limited technical barriers. Here’s hoping that after consulting privacy groups, and as the inevitable privacy debate goes forward, TfL can come up with a system that can meet both satisfy privacy concerns and deliver the potentially transformative transport analytics that can make the tube better for everyone. Here’s hoping TfL can find a system worth its salt.

James O’Malley is Interim Editor of Gizmodo UK and tweets as @Psythor.
Dr Lukasz Olejnik is an independent cybersecurity and privacy researcher, and tweets as @lukOlejnik. He has written his own (significantly more technical) analysis here.