AI researchers estimate that 97% of EU websites fail to comply with GDPR privacy requirements, especially user profiling

Researchers in the US have used machine learning techniques to study the GDPR privacy policies of more than a thousand representative websites in the EU. They found that 97% of the sites surveyed failed to meet at least one requirement of the 2018 European Union regulatory framework, and least of all they met the legal requirements surrounding the practice of ‘user profiling’.

The sheet states:

‘[Since] the privacy policy is the essential communication channel for users to understand and control their privacy, many companies updated their privacy policies after the GDPR came into effect. However, most privacy policies are comprehensive, full of jargon, and vaguely describe companies’ data practices and user rights. As a result, it is unclear whether they comply with the GDPR.’

It goes on:

“Our results show that even after the GDPR came into effect, 97% of websites still fail to meet at least one requirement of the GDPR.”

The study is titled Automated Detection of GDPR Disclosure Requirements in Privacy Policy Using Deep Active Learning, and comes from three researchers at the University of Virginia at Charlottesville.

Privacy Latest

The area with the least compliance, according to the survey, was the GDPR’s provisions on user profiling, with the authors stating that only 15.3% of the sites surveyed fully complied with this particular rule.

A chart of compliance among 9,761 websites surveyed for the study. Source:

User profiling (which records a person’s interaction with websites and is often used to “target” them in other online contexts, such as advertising) has become one of the biggest controversies in technology since the Cambridge Analytica scandal.

On Tuesday, a major European Parliament committee approved the first phase of the new Digital Markets Act (DMA) legislation, which would ban the behavior of minors and impose fines of up to 20% of global annual turnover for infringing companies.

While the law has been received by the media as a direct response to the growing influence of tech giants such as Facebook and Google, the sheer scale of non-compliance represented by the new research suggests that the vast majority of EU companies (including EU offices of US companies trading in Europe) are legally exposed to GDPR fines.

In addition, Italy this week imposed the maximum allowable fine of 10 million euros ($11.2 million USD) on Apple and Google for, among other things, abuse of user profiling.


The sites examined in the new research were sampled from the top 10,000 websites listed in Quantcast, whose English-language privacy policies were extracted via Yandex searches on UK-based VPNs (to ensure the policies aren’t geographically blocked). used to be) .

EU websites are required to provide prescribed privacy policies, which cover 18 core requirements (see chart above) since the General Data Protection Regulation (GDPR) came into full effect in May 2018.

The researchers limited their extraction of privacy policies to a period beginning in August 2018, to give domains a reasonable time to publish the required policies (a requirement they had prior knowledge of at least a year of AVG’s two-year development phase since 2016).

The filtering process yielded a privacy corpus of 9,761 policies, from which the researchers randomly selected 1,080 policies.


The team hired two legal experts to train four human annotators to label each of the 18 possible privacy policies imposed by the GDPR.

Some legal texts in the policies covered more than one of the 18 requirements, making it necessary to use a Convolutional Neural Network (CNN) to detect language features associated with each policy.

A first attempt to train a model to identify compliance based on language yielded 80.5% success. To improve these results, the researchers applied Active Learning to improve the model’s performance with less labeled data. In this way it was possible to train the classifier CNN to an accuracy of 89.2%, with an F1 score of 0.88 (where ‘1’ is complete success).

To ensure the word embedding was privacy policy specific, the researchers trained an unsupervised word embedding model using Facebook’s FastText Python library.

According to standard practice, the final data was split 80/20 between trained data and test data (ie, randomly selected data against which the accuracy of the algorithm will be judged). To evaluate the quality of the results, a human-in-the-loop measurement study was added to the architecture.

The architecture for the classification system.

The architecture for the classification system.

Over the course of the workflow, 11,271 human-annotated privacy policy segments were produced, each of which was reviewed by four human annotators trained by the two legal experts involved in the study. Where disagreement occurred, a 75% agreement ratio was needed to avoid rejecting the data for inclusion.

Humans-in-the-loop - It was not possible to fully automate the labeling of the policy data, although Active Learning enabled a pool-based workflow that made the project feasible.

Humans-in-the-loop – It was not possible to fully automate the labeling of the policy data, although Active Learning enabled a pool-based workflow that made the project feasible.

In addition to the results already mentioned, the users found that: portability — the right under the GDPR to move or export a company’s data — was served almost as badly as profiling.

The researchers conclude:

‘[Requirements] such as the right to portability of users and the provision of the contact details of the data protection officer (DPO contact) are covered by 15.5% and 16.4% websites respectively. Other primary requirements, such as users’ right to complain, withdraw consent, right to object and adequacy decision are covered by 17-20% of websites.”

…and continue:

‘It turns out that only 3% of the websites fully meet 18 requirements. These findings indicate that many websites still do not comply with the requirements of the GDPR.’

Leave a Comment