Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

We are releasing the Princeton-Leuven Longitudinal Corpus of Privacy Policies, a reference dataset of over 1 million privacy policy snapshots from more than 100,000 websites, spanning over two decades.

Request access »

Read the paper »

Automated analysis of privacy policies has proved useful for research, but so far there has been no large-scale longitudinal dataset that can be used to study how privacy policies have changed with time.

To address this gap, we are releasing a dataset of over 1 million privacy policies collected from the Internet Archive’s Wayback Machine. To build this dataset, we developed a custom crawler that detects and downloads privacy policies from archived web pages. We processed the downloaded policies to clean up error pages, extract the text of the privacy policies, and filter out non-policy documents using machine learning.

The dataset contains 1,071,488 English-language privacy policy snapshots from 130,620 distinct websites chosen from the Alexa Top 100K from 2009-2019. In addition to sanitized privacy policy text and raw webpage HTML, the dataset includes metadata such as the archival time and the website URL that the policy belongs to. Although the dataset contains policies from as early as the late 1990s, more than 90% of the policies are from 2007 or later.

To get access to the data, please send an email to privacy-policy-data@lists.cs.princeton.edu stating your name and affiliation.

If you use our dataset, please cite us:

@article{amos2020privacy,
    title={Privacy Policies over Time: Curation andAnalysis of a Million-Document Dataset},
    author={Ryan Amos and Gunes Acar and Elena Lucherini and Mihir Kshirsagar and Arvind Narayanan and Jonathan Mayer},
    year={2020},
    eprint={2008.09159},
    archivePrefix={arXiv},
    primaryClass={cs.CY}
}