Page:On the Robustness of Topics API to a Re-Identification Attack.pdf/5

 opened it for experimentation [14]. Using EasyPIMS, a user has the possibility to upload their personal information and fully control which data to share and for what purpose. A simple web interface allows the user to provide fine-grained consent for sharing the data with data buyers and eventually to monetise their data in a marketplace. Among various types of data, the platform allows users to share their browsing history by installing a browser plugin for Google Chrome or Microsoft Edge on their PC running any operating system. Such plugin records all intentionally visited webpages and stores them in a central repository. During the test of our PIMS, we recruited 3, 369 volunteers who had the possibility of using the platform for four months in 2022. Out of them 928 installed the plugin. To join the PIMS, there was no restriction on the geographic area, and users belong to 35 different countries in Europe, Asia, and America. Considering the demographic information of the population, 478 are male, 226 are female and 224 did not declare their gender. The age ranges from 18 to 72 years, the average being 33.

In this paper, we leverage the actual browsing histories of EasyPIMS users that explicitly provided their consent for research purposes to the usage of their browsing history and any personal data we use. 613 gave such permissions. Among those, we restrict the population to those users that actively used the platform. Since the Topics API operates on a weekly basis, we consider a user to be active in a given week if they visited at least 10 webpages. In total, we obtain 268 users that result active in at least one week. We use the sequence of websites visited by these users for our study.

Ethical Aspects. Our data collection process is compliant with ethical principles and EU privacy regulations. EasyPIMS was part of a European Project involving 12 partners and the European Commission has approved all the data collection and processing procedures. Users voluntarily participated, were informed, explicitly opted-in via the PIMS web interface, and were rewarded by sweepstakes. We only use data of users who explicitly provided their consent for the specific purpose of research, which was not the default choice of the platform. Moreover, data processing has been carried out in an anonymous fashion using a secure computing infrastructure running up-to-date software and with restricted physical access to authorized personnel. During data processing, we only process data regarding browsing histories, neglecting all other attributes, such as name, gender, or geographic location.

In total, our dataset includes 2, 813, 283 webpage visits to 50, 976 different websites. The number of visits per user per week varies significantly, with some users that used the platform for a few weeks and others for the whole four-month experimental period. Some users even installed the plugin on multiple browsers and devices (e.g., desktop and laptop PC), increasing the amount of data collected in their accounts. In detail, we characterize the different usage patterns in Figure 2. We show the Empirical Cumulative Distribution Function (ECDF) of the number of page and website visits each user recorded each week in Figure 2a and Figure 2b, respectively. We observe a large variability. In the median, active users access 222 web pages each week, with 26.1% of users that visit less than 50 pages; conversely, 14% of the users visit more than 1, 000 pages. The most active users have accessed about 10,000 pages in a week. Similar considerations hold when we focus on the number of unique websites a user visits in a week in Figure 2b. On the median, active users access 30 different websites in a week, while the 25th and 75th percentiles of the distribution are 10 and 71 websites, respectively. The most active users access more than 500 websites in a week. Overall, we believe these figures reflect the natural variability of users. Despite being limited, our dataset includes a real population of users browsing the web, with different interests, backgrounds, nationalities, etc. Unfortunately, we cannot advocate our dataset is representative of general human behaviour and we do not exclude it may be biased in some direction such as gender or education. In the following, we use it to study the impact of the Topic API algorithm to avoid an attacker to mount a re-identification attack.

Using the current implementation of the Topic API ML model Google opened since Chrome 101, for each of the 50,976 websites $$w$$ in our dataset, we extract the corresponding topic $$t$$ the API returns. We obtain 250 topics visited at least once by users in our dataset. In the following, we report the characterization of the topic visits. Focus first on the number of unique topics each user visited at least once during the entire experimentation. This is useful to understand how complicated (and unique) could be a Profile $${P}_{u,e}$$. We report the ECDF in Figure 3a. The distribution is quite spread: in the median users visit 36 topics, with the most diverse users visiting more than 150 topics. Conversely, a handful of users visit less than 5 topics. Not reported here for the sake of brevity, the median number of topics each user visits per week is 17, with a maximum of about 70. Only less than 10% of users visit less than 5 topics in some weeks. Figure 3b reports the ratio of users visiting a given topic. We sort topics by their popularity in decreasing order. The ranking follows a clear power-law distribution (notice the log-log scale), as typically happens with popularity distributions in web measurements [2]. The top-5 topics are Search Engines, News, Arts & Entertainment, 5