Page:On the Robustness of Topics API to a Re-Identification Attack.pdf/2

 researchers must verify the robustness of such an approach as done by Mozilla and Google [8, 25].

In this paper, we provide an independent evaluation of the Topics API. Using a data-driven approach, we build realistic population models that we use to quantify the feasibility of a re-identification attack: We assume that the attacker i) exploits the Topic API to reconstruct the victim’s profile by accumulating her/his topics over epochs, and ii) tries to re-identify the victim on the audience of a second website – as studied by Epasto et al. [8]. If successful, such an attack would tamper with the abandonment of third-party cookies, allowing platforms to still track users across websites. We face the problem by mapping it to the probability that a user is 𝑘-anonymous among the website audience, i.e., that there are 𝑘 − 1 other users with the same reconstructed profile. Generalising the attack sketched by Thomson [25], we propose a robust denoising algorithm that aims to filter the random topics introduced by the Topics API.

We contribute to three main results: of the attacker’s observation period and that many weeks may be needed to carry out the attack in practice. Our study highlights the need for continued research and development of privacy-preserving advertising techniques to ensure that user privacy is respected in the digital age. To foster research in this field, we release the code and data to replicate and extend our experiments. The remainder of the paper is organized as follows: Section 2 formalizes Topics API operation and the threat model. In Section 3 and 4, we describe the dataset and models to generate synthetic populations we use to run simulations, respectively. Section 5 illustrates the results in terms of 𝑘-anonymity, while Section 6 explores the effectiveness of a re-identification attack. Section 7 summarizes related work, and, finally, Section 8 discusses our findings and concludes the paper.
 * We show that the introduction of Topics API algorithm mitigates but cannot prevent re-identification. Depending on the website’s audience size (e.g., 100,000 visitors) and population heterogeneity, a sizeable fraction (e.g., 40%) of users would still let the attacker reconstruct a denoised and unique profile that allows re-identification if matched on a second population.
 * We demonstrate the replacement of actual topics with random ones is key to limiting the attack. Yet, the denoising algorithm is very efficient in removing random topics from the reconstructed profiles the attacker builds.
 * We show that in practice the probability of correctly reidentify a user in a pool of 1,000 can top 15-17%, with false positives being negligible (less than 0.2%). However, it is also important to consider that such probabilities are a function

Table 1: Main terminology to model Topics API algorithm and threat model.

In this section, we describe how the Topics API operates for creating a profile from the user’s browsing history. Then, we describe our threat model – i.e., the possibility that an attacker links two profiles referring to the same user as they are uniquely identifiable within a given population.

We consider a browser that a user employs to navigate the Internet. We assume time is divided into epochs of duration $$\Delta T$$ (one week in the current proposed Topics API operation). During each epoch $$e$$, the browser collects and counts the number of visits to each website and forms a bag of websites $$\mathcal{B}_{u,e}$$ for the user $$u$$. It keeps track only of the website hostnames the user intentionally visited, e.g., by typing its URL, or by clicking on a link in a web page or other applications. Formally, given a user $$u$$ and the epoch $$e$$, let $$\mathcal{B}_{u,e} = \{ (w_1, f_{1,u,e}), (w_2, f_{2,u,e}), \ldots, (w_n, f_{n,u,e}) \}$$, where $$\{ w_i \}$$ represent the visited websites and 𝑓𝑖,𝑢,𝑒 the number of times $$u$$ visited $$w_i$$ during epoch $$e$$.

The Topics API algorithm operates in the browser and processes the history of $$\mathcal{B}_{u,e}$$ over the past 𝐸 epochs to create a corresponding Exposed Profile $$\mathcal{P}_{u,e,w}$$ for the user $$u$$, epoch $$e$$ and each specific website $$w$$ the user visits during the current epoch. In fact, the browser builds a separate Exposed Profile for each visited website 𝑤 to mitigate re-identification attacks. We base the following description on the public documentation of the Topics API available online. The operation of the Topics API has the following steps.

Step 1 - From websites to topics: For each of the websites $$w_i \in \mathcal{B}_{u,e}$$, the browser extracts a corresponding topic $$t_{i}$$. To this end, the browser uses a Machine Learning (ML) classifier model that returns the topic of a website given the characters and strings that compose the website hostname. At this step, each browsing history $$\mathcal{B}_{u,e}$$ is transformed into a topic history $$\mathcal{T}_{u,e} = \{ (t_1, f'_{1,u,e}), (t_2, f'_{2,u,e}), \ldots, (t_m, f'_{m,u,e}) \}$$ where $$t_{i}$$ represents the topic the model outputs, and $$f^'_{i,u,e}$$ counts its total occurrences. Each website is mapped to a topic and the original frequencies $$f_{i,u,e}$$ are summed by topics into $$f'_{j,u,e}$$. There are $$n_{topic}$$ which form a taxonomy of possible interests the users have. Such taxonomy will include between a few hundred and a few thousand topics (the IAB Audience Taxonomy contains about 1,500 topics).

2