Google as an example

One of the biggest web scrapers in the world is Google. They started with web scraping internet data and built a search engine around it. They collect public internet data, store it, process it, and show it on their own website with an original link back to the source. Is this legal?

 

What is public data?

A website without authentication is seen as public data; someone publishes the data on the internet for public use. If it is not public, they will add a user registration with general terms on how to use the data.

 

Texts and pictures are often copyright protected. If you scrape these and copy them one-to-one on your own site, you should treat this data as a quote and include a link to the original site. However, most of our customers are interested in factual product information such as specifications, prices, developments, or SEO. This kind of data can be perfectly collected with web scraping.

 

Local legislation

We also deal with local rulings like in the EU, where we have the GDPR. We are only allowed to scrape non-personal data. If we are to scrape personal information, we need to have approval from the person whose personal information it is. We are not allowed to collect this kind of data. Therefore, please do not ask us to scrape Facebook, as we will not do this.

 

Terms & conditions and data behind a login

General terms must be accepted on a personal level, meaning you are required to register that the user has accepted the terms and document their name. Normally this is done when you create an account. Sometimes customers ask us to collect data behind a login. Technically we can do that, but then we first need to check the terms and get approval from the website where we are collecting information.

 

Protection against scraping

How to deal with protection against scraping? A lot of sites have detection on scrapers and make their data difficult to scrape. It is however still public data, and you are allowed to collect it. You could even argue that it is unfair competition because these websites give their data freely to Google (without protection) but not to other scrapers. Even so, it is still legally allowed for us to collect this data, and we have advanced technology in place to do so.

 

How we collect data legally

  • We keep our data collection in sync with GDPR.
  • We do not overload websites; in general, we do not do more requests than 1 every 5 seconds.
  • We have an ethics commission that validates each assignment based on requirements that are stricter than what is legally allowed.
nl_NLNederlands