Page:On the Robustness of Topics API to a Re-Identification Attack.pdf/11

 and effective techniques to link the two sets [13, 27]. We leave the evaluation of these alternative approaches for future work.

From the dawn of the Web, behavioural advertising has been a pillar of the ecosystem and entailed the collection of personal information through web tracking. This phenomenon has been the subject of several studies that measured its spread [7, 16] or dug into its technical operation [1, 18, 23]. The implications of web tracking on users’ privacy have become more and more debated by the industry [12] and by the research community [10, 15, 24]. It also fostered the birth of anti-tracking tools (i.e., the Ad and Tracker Blockers [19]) and encouraged the legislator to issue privacy-related regulations, such as the US CCPA [4] or the European GDPR [11]. Federated Learning of Cohorts (FLoC) has been the first public effort by Google to go beyond the classical web tracking based on third-party cookies [21]. In FLoC, users were grouped in cohorts according to the interests inferred by each one’s browser. When asking for information about a user visiting a website, third parties were offered the user’s cohort, from which they could have information about the user’s interests. In the intention of the proposal, FLoC provided an acceptable utility for the advertisers, while hiding the user (and thus, her identity) behind a group of peers [9]. However, criticism arose around the easiness for first- and third-party cookies to follow the user over time exploiting the sequence of cohorts to which she belongs to isolate and thus identify her [22]. The attack can exploit browser fingerprint to further improve its effectiveness [3]. FLoC’s privacy anonymity properties can be broken in several ways [26]. As a response to the critics towards FLoC, Google retired the proposal and conceived the Topics API, whose functioning we describe in Section 2.1. The Topics API exposes users’ profiles in terms of topics of interest to the websites and advertising platforms. In this paper, we study to what extent users’ profiles can be used by an attacker to reidentify the same individual across time or space. Past works already demonstrated that profiling users based on their browsing activity can present severe risks to the privacy of the users [10]. They can be identified with high probability based on the sequence of visited websites [13, 17, 27]. Mitigation such as partitioned storage has been put in place to limit the risk, but ways to bypass them exist [20]. Specifically to the Topics API, the same threat we analyze has been already identified by Epasto et al. [8] from Google. The authors carry out an information theory analysis and conclude that the attack is hardly feasible. In this paper, we go a step further. Our analyses are not limited to an analytical study on profiles’ uniqueness but offer a thorough evaluation using real traffic traces and different user models. While writing this paper, proponents from Google published a new work discussing the privacy implications of the Topics API in [5]. They define a theoretical framework to determine reidentification risk and test it on Topics API. Differently from us, they do not consider the use of any denoising algorithm. To the best of our knowledge, Thomson [25] from Mozilla has issued the first independent study on the privacy guarantees of Topics API, elaborating on the conclusions by Epasto et al. [8]. He again used analytical models and raised severe concerns about the offered privacy guarantees. We inspire our strategy for random topic filtering from Thomson [25].

Summary. The Topics API represents a prominent proposal to replace the current web-tracking solutions based on third-party cookies with a more privacy-friendly approach. In this paper, we have considered the scenario where an attacker carries out a reidentification attack accumulating the topics a website gets via the Topics API to build a unique user profile. Our experiments show that such an attack can be successful, provided the attacker observes the victim for enough epochs. We showed how the replacement of actual topics with random ones is fundamental for limiting the reconstruction of users’ profiles. We designed an algorithm to overcome such protection so that the attacker is able to denoise the reconstructed profiles and remove random topics. This makes the denoised reconstructed profiles robust, and the attack possible for a large range of the probability 𝑝 of a random topic replacement happening. All in all, we showed how the re-identification attack mounted by websites succeeds in about 15-17% of users, with a negligible probability of a false re-identification.

Limitations. While the attack is possible, several points need to be considered to judge the actual feasibility of such an attack: Improvement of Topics API. Being a draft proposal, there is still room to discuss possible improvements to the Topics API. For instance: 11
 * The time needed makes it impractical: Given the suggested epoch duration of one week, the attacker needs 20 to 30 weeks (i.e., ≈ 6 months) to successfully reconstruct the victims’ profiles.
 * During such time, the victim has to visit the attacker’s website every week (or at least every $$E$$ weeks). If this does not happen, the attacker would need even more time to accumulate enough topics the victim is interested in.
 * The victim’s interests may change over time. This is not critical for the re-identification attack mounted by two websites (as they will observe the victim during the same time period). But this may harm the re-identification attack against an a priori known victim profile.
 * The larger the population, the harder the attack. Yet, the attacker may leverage external information to partition the audience and thus increasing the probability of a successful attack.
 * Periodically deleting first-party cookies would bring an immediate privacy-related benefit, reducing the number of epochs to build profiles that could be matched across websites. Figure 9 shows that deleting the first-party cookies every $$N = 10$$ epochs, for example, would keep the Prob(correct re-identification) below 2% with the current attack setup, with comparable Prob(incorrect re-identification).
 * The default values of $$z$$, $$E$$, and $$p$$ proposed in the draft proposal of Topics API are open to further review. We believe that the code we offer can be used as a tool to investigate