Publicly available data under the GDPR

bilekas · on Feb 18, 2020

The GDPR however is about the information being stored being specific for the requirements, but more the consent for information to be stored and the option to not have information stored.

Also the time for which that information can be stored. If it comes from public sources, there needs to be a proven reason why its stored there, and if it is not required, it must be removed.

I wasn't sure of any good example of 'Public Information' from the article..

jimmaswell · on Feb 18, 2020

We have a wealth of information on Beethoven's life from his conversation books - I wonder if future generations will be deprived of such things from the important people of today because of these privacy laws.

r_singh · on Feb 18, 2020

> I wasn't sure of any good example of 'Public Information' from the article..

A decent example would be online reviews. When someone leaves a review on Yelp or Tripadvisor, they're making their name, avatar, city location, interest, opinion on a subject matter—all public.

Now there are websites that scrape these reviews and display them on their own property with their source attributed to the original review (Google and Agoda are examples). And there are web applications like Podium, Yext, etc. which help companies manage their online reviews by scraping them from all the business' listings.

While a user can delete their information from the original data source assuming that the original data source obliges with GDPR, they may not be aware of what other parties have scraped and stored this information in their databases without explicit consent.

Another relevant example could be publishing info on LinkedIn and having it scraped by companies like HiQ.

bilekas · on Feb 18, 2020

Okay, but that scraped information is not in line with GDPR. The user has not given consent for that information to be stored.

Yelp may be in line, storing identifiable information for particular reasons in order to utilize their services.

But scraping is never in line with GDPR, if you are scraping, which is allowed for example, the information on users and people cannot be stored.

mstolpm · on Feb 18, 2020

Could someone downvoting bilekas comment please eli15 why his comment is wrong?

I'm under the impression that this is exactly the GDPR position: You are not allowed to store and process PII if you don't have the consent of the person identified. I'd be happy to be proven wrong.

mikekchar · on Feb 18, 2020

Generally speaking, I don't think you are wrong, although the use of the word "consent" is potentially confusing. There are multiple lawful bases with which you can use PII. "Consent" is one of the. Others include "contractual", "legal obligation", "vital interest", "public task", "legitimate interest". For the case of a private entity scraping a website, it is unlikely that these other bases would apply, but it's not impossible. It would be interesting, for instance, if some kind of police work might be classified under "vital interest". But generally, you will need consent because none of the other bases will apply.

mytailorisrich · on Feb 18, 2020

Law enforcement is outside the scope of the GDPR and thus I would think that legitimate police work is exempt from the regulations.

mikekchar · on Feb 18, 2020

https://ico.org.uk/for-organisations/guide-to-data-protectio...

impartial-word · on Feb 18, 2020

If you expose your information you’ll be spammed, and when the spammer is challenged he/she will reason that it was done under “legitimate interest”. The only way is closing the leak and avoid any of your personal information exposed on the internet.

History time: a shady company that organises conferences contacted me guessing my corporate email from information in Linked-in. When challenged they said “legitimate interest” and removed my information at the moment. I sent a formal complain to as much people as I could (they also exposed a lot of emails online) and left in some message boards information about this company so other people affected could understand easier where the information was leaked.

They contacted three months later with a different company name (same speakers, same program, same names everywhere, same web, different domain), I repeated the same procedure. Also my name in Linked-in has only my initials.

IAmEveryone · on Feb 18, 2020

You'll be happy to hear that consent is but one of six possible legal bases for storing data.

The others are legal requirement, required to perform a contract (with the subject), legitimate interest, vital interest, and public interest.

For scraped data, public interest and legitimate interest are likely to be most salient.

BeniBoy · on Feb 18, 2020

Though keep in mind that to rely on legitimate interest you have to do a "balancing test" between your interest and the ones of the user.

This is not an easy legal basis to rely on, and most of the time DPA are quite wary of processing relying on it!

mstolpm · on Feb 18, 2020

Thanks for the explanation. I understand that there are other legal reasons allowing storage and processing of (some) PII in (some) cases without given user consent. But does that fit for "scraped PII"?

I'd be very sceptic giving "public interest" or "legitimate interest" as a reason for storing and processing of PII in scraped data: "Public interest" (if not about some VIP or other public figure) mostly seems to be based on summarized data, so you would at least need to anonymize the data immediatly. The (identifiable) name or picture of a random scraped comment is hardly of "legitimate" or "public" interest. And vital interest or legal requirement for scraped data is hard to argue as well in most cases (I'd be even skeptical if used for law enforcement). Moreover, I have a hard time seeing a lot of cases where scraping PII by a company, organization or single person is required to perform a contract with the subject of the PII.

Let's take a concrete example: Why would someone have the right to scrape all my Amazon reviews from the Amazon website under public or legitimate interest and store/process them with my name and my identifiable PII? I can see a public interest in scraping reviews and processing them without the PII, but I don't see a reason to do so without anonymization.

So, I still see the comment holding: Scraped data falls under the GDPR and doesn't allow to store and process PII in most cases.

ratherbefuddled · on Feb 18, 2020

GDPR doesn't work like that, the means of obtaining the data does not relate to the legal basis you have for processing it.

If you process personal data you need at least one legal basis, you decide what that is and you take on certain obligations.

If you rely on "legitimate interest" GDPR requires you to consider the balance of your interest versus the subject's right to privacy. As long as you do so, record your determination and take reasonable steps to mitigate the privacy impact - such as allowing opt out, aging the data out over time etc - and you make good on your obligations it is unlikely an enforcement agency will find against you. It is a subjective decision however and there haven't been many GDPR cases handled by member states yet, so interpretations might change over time and all you can go on is guidance from regulators presently.

I think a good example of where scraping and legitimate interest would be ok is if you are trying to sell a product appropriate to people with <job title>. You go to LinkedIn, pay for a facility that allows you to search for that job title, scrape that data and attempt to contact those people. There is some privacy impact but it is likely that people who make themselves available on a business networking site might reasonably expect to be contacted about things relevant to their job title. As long as they can opt out from your processing of their data and you don't keep it for longer than necessary and meet the other obligations you'll likely be fine (notwithstanding the ePrivacy directive which is a separate thing).

An example where your grounds might be a lot more tenuous would be a recruitment agent scraping your name from a question on StackOverflow and guessing your email address to contact you about a job.

In your Amazon review example, the processor would need to justify capturing your name. For most processing purposes (eg grouping the reviews by author) a hash would suffice.

bkor · on Feb 18, 2020

Could you add references to the sections of the GDPR which explain this? The various times I read the GDPR (as well as the local law) I didn't see your explanation in it. References would allow me to determine what I missed.

BrentOzar · on Feb 18, 2020

> I wasn't sure of any good example of 'Public Information' from the article.

Questions and answers at Stack Overflow.

Questions often include code snippets, some of which even include personally identifiable information like names and addresses. That data becomes public immediately, and then it’s reused by others when they download and adapt the Stack Overflow Data dumps.