Page:On the Robustness of Topics API to a Re-Identification Attack.pdf/3

 In our experiments, we employ the Google ML model implemented in Chrome. In its current implementation, it supports $$n_{topic} = 349$$ topics and the model is based on a Neural Network trained by Google using a manually curated set of 10,000 domains. It leverages website hostnames only and neglects any other part of a URL

Step 2 - From Topics to Profiles: Given the topic history $$\mathcal{T}_{u,e}$$ for user $$u$$ at epoch $$e$$, the browser selects the $$z$$ most frequently visited topics and stores them into the Profile history $$\mathcal{P}_{u,e}$$, which will be referred as the user $$u$$ Profile at epoch $$e$$ in the following. $$z$$ is currently put to 5.

Step 3 - Per-website topic selection: The first time the user visits the website $$w$$, the browser generates a Exposed Profile $${P}_{u,e,w}$$. For each past epoch $$i \in \{e - 1, ... e - E\}$$, the browser selects at random one topic $${t}_{i}^{*}$$ from the Profile history $${P}_{u,e,w}$$. $${P}_{u,e,w}$$ contains thus at most 𝐸 topics. To increase privacy guarantees, at each extraction, with probability $$p$$ the browser replaces the topic 𝑡𝑖∗ with a random topic $$trnd$$ uniformly selected from the global topic list. 𝑝 is currently suggested to be 0.05. $$\mathcal{P}_{u,e,w}$$ contains thus at most E topics (a topic picked from $$\mathcal{P}_{u,e} -1$$ a topic from $$\mathcal{P}_{u,e} -2$$, etc.). Once generated, the Exposed Profile remains the same for the whole epoch 𝑒.

Usage by websites: From this point on, each time the user visits the website 𝑤 during the current epoch, the website 𝑤 may request the browser to share the current Exposed Profile $${P}_{u,e,w}$$ and use the returned topics to provide behavioural advertising. Notice that the Exposed Profile $${P}_{u,e,w}$$$${P}_{u,e,w}$$ is built only for websites intentionally (first-party) visited by the user $$u$$. Any third-party service (e.g., a component embedded on the webpage of site $$w$$, but hosted on a different domain) will receive topics of the first-party websites $$w$$ it is embedded into. That is, all trackers embedded into the website $$w$$ receive always the Exposed Profiles $${P}_{u,e,w}$$ of $$w$$.

Periodic Profile update: At the beginning of the epoch $$e + 1$$, the browser computes the new Profile history $${P}_{u,e+1}$$ and discards $${P}_{u,e - E}$$. Similarly, if and when the user visits again the website $$w$$, the browser creates $${P}_{u,e+1,w}$$ from $${P}_{u,e,w}$$:it includes a new topic selected from $${P}_{u,e+1}$$ (Step 3), and removes the oldest topic, i.e., the one originally belonging to $${P}_{u,e - E +1}$$ (keeping the others). This means that a website continuously visited by a user can observe up to one new topic per epoch (and such topic may be randomly extracted).

In this paper, we consider the threat model introduced by the same proponents of Topics API [8] and discussed in a technical report by Mozilla [25]. In detail, we consider the risk of re-identification – i.e., the possibility to link a Reconstructed User Profile from an audience to a known individual; or that two websites use the Reconstructed User Profiles to match their audiences. Such possibility has already been evaluated in the literature on similar contexts [13, 17, 27]. We sketch the second attack in Figure 1. 2.2.1 The re-identification threat model. As in [8], we assume a website $$w$$ uses first-party cookies to track a user over time so that it can reconstruct the set of topics users in its audience are interested in. Then, it matches the derived profiles with the target profile of the victim (or with all profiles of the second website audience). In this attack, the attacker accumulates the Exposed Profiles $${P}_{u,e,w}$$ over epochs, overcoming the limitation introduced by Topics API to limit the Exposed Profiles to one topic per epoch, for at most $$E$$ epochs. Let us assume $$w$$ observes its users $$u \in U (w)$$ for 𝑁 epochs (i.e., epochs in [1, 𝑁 ]). At the end of the process, for each user $$u$$, it builds the Global Reconstructed User Profile as $${G}_{u,N,w} = {U}_{e \in [1,N ]}{P}_{u,e,w}$$. In the long run, the set of topics could act as an identifier string (or fingerprint) for user $$u$$, enabling the re-identification process either with the set of topics of a known user or with users from the audience $${U}_{2}$$ of website $${w}_{2}$$. Notice that this attack may be carried out by a third party too. In this case, we assume some websites $${w}_{1}$$ and $${w}_{2}$$ collude with a third-party service $$s$$. Both $${w}_{1}$$ and $${w}_{2}$$ embed $$s$$. They both share with 𝑠 the user identifier each time a user visits them. The third party then builds $${G}_{u,N,{w}_{1}}$$ and $${G}_{u,N,{w}_{2}}$$ autonomously so that it can match the profiles of users in both audiences

2.2.2 Random topic replacement to prevent the attack. Being aware of such an issue, the Topics API algorithm injects random topics with probability $$p$$ in the Exposed Profile – see Step 3 in the previous section. This has the benefit of making the Global Reconstructed Profile $${G}_{u,N,w}$$ both noisy (thus preventing the exact re-identification with a known victim profile) and potentially identical for all users (i.e., for $$N \rightarrow \infty$$, all users’ Global Reconstructed Profiles would include all topics). Notice that the injection of random topics runs separately for each website so that the Exposed Profiles on the two websites would be different.