Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

We are releasing the Princeton-Leuven Longitudinal Corpus of Privacy Policies, a reference dataset of over 1 million privacy policy snapshots from more than 100,000 websites, spanning over two decades.

Request access »

Read the paper »

Download our code »

Automated analysis of privacy policies has proved useful for research, but so far there has been no large-scale longitudinal dataset that can be used to study how privacy policies have changed with time.

To address this gap, we are releasing a dataset of over 1 million privacy policies collected from the Internet Archive’s Wayback Machine. To build this dataset, we developed a custom crawler that detects and downloads privacy policies from archived web pages. We processed the downloaded policies to clean up error pages, extract the text of the privacy policies, and filter out non-policy documents using machine learning.

The dataset contains 1,071,488 English-language privacy policy snapshots from 130,620 distinct websites chosen from the Alexa Top 100K from 2009-2019. In addition to sanitized privacy policy text and raw webpage HTML, the dataset includes metadata such as the archival time and the website URL that the policy belongs to. Although the dataset contains policies from as early as the late 1990s, more than 90% of the policies are from 2007 or later.

To get access to the data, please send an email to privacy-policy-data@lists.cs.princeton.edu stating your name and affiliation.

Our dataset is also available as a Github repository. You can use the web frontend to easily browse the archived policies.

If you use our dataset, please cite us:

@inproceedings{amosPrivacyPoliciesTime2021,
  title = {Privacy {{Policies}} over {{Time}}: {{Curation}} and {{Analysis}} of a {{Million}}-{{Document Dataset}}},
  booktitle = {Proceedings of {{The Web Conference}} 2021},
  author = {Amos, Ryan and Acar, Gunes and Lucherini, Eli and Kshirsagar, Mihir and Narayanan, Arvind and Mayer, Jonathan},
  date = {2021-04-19},
  pages = {22},
  publisher = {{Association for Computing Machinery}},
  location = {{Ljubljana, Slovenia}},
  doi = {10.1145/3442381.3450048},
  url = {https://doi.org/10.1145/3442381.3450048},
  series = {{{WWW}} '21}
}

Our dataset is also available as a Github repository. You can use the web frontend to easily browse the archived policies.