Go Back

3 Different Scraping Methods

Jul 27th - 3min.

3 Different Scraping Methods

Now that you’re familiar with the definition of scraping, it’s time to go a step further. In this blog we dig deeper into three different types of scraping to extract data from the internet. You will get to know flat scraping, selenium scraping and browser plug-ins.

Now that you’re familiar with the definition of scraping, it’s time to go a step further. In this blog we dig deeper into three different types of scraping to extract data from the internet. You will get to know flat scraping, selenium scraping and browser plug-ins.


1. Flat scraping (static)

Before moving on, let’s take one step back: how do websites work. To see a website as we know it, a browser runs the HTML-sourcecode of a website and turns it into a visual website.

How does flat scraping work?

With flat scraping you get the HTML source code. Usually this is done with Python. You make requests to a server and so you get the returned source code, which is a flat file. Next the HTML page can be searched to look for certain data, such as META tags for example. All this without running the source code and converting it into a site. We call this phenomenon flat scraping. In most cases, you won’t find what you are looking for, because you are missing a lot of information, since the source code isn’t loaded. You’ll also need a browser to find the data you're looking for, read more on this below.

An attack in plain sight

The downside of this crawling method is that both large and small companies can easily detect this way of scraping. Querying HTML codes without rendering them is suspicious, because that isn’t normal browser activity. A normal user opens a browser and performs actions on a site. Didn’t you use a browser, but you’re still requesting source codes? Then you are a flat scraper and companies know that.

2. Selenium scraping (dynamic)

Selenium scraping, also known as browser or Chrome scraping, works like this: the bot opens a browser, navigates to a site and performs actions such as clicking buttons, scrolling and filling in forms. Finally, the web browser automation tool will close the site. Just like you as a person would do. That's what we’re trying to do with scraping. We replace manual actions, or a person, with an automated (ro)bot. It’s simply a loophole to retrieve data faster.

How does Chrome scraping work?

Since we use a browser with Selenium, we render the HTML source codes, CSS and JavaScript files. The website therefore loads parts dynamically with JavaScript, in this way we can find the data we’re looking for. With this form of scraping, we automatically perform actions via a browser on a website.

A serious effort for a server

The problem with Chrome scraping is that crawling a million URLs is very tough for a server. Why? A browser, such as Chrome, never really closes its window. You may end up with a server running hundreds of instances of Chrome, making your server super slow. Why Chrome never really stops is unknown.

Browser automation is detectable

This method of scraping has worked well for a long time and is still used by many companies. The problem we see nowadays is that companies can detect which version of Chrome someone is using and whether it is automated or not. “Everything” that is controlled automatically will be blocked by business. Facebook, for example, will immediately recognize you as a bot and send you away. Moreover, we know from experience that you can only scrape 900 pages per session at Facebook. These safety measures also make it a lot harder to index giants like them.

Only manually operated Chrome versions are still allowed. Timothy Verhaeghe, CTO ProductBuilder: “The point here is that your IP address will be blacklisted forever (but really forever!) And you will never be able to surf to Facebook again for example. Unless you harass Telenet and request a new IP address, of course (laughs). ”

The best scrapers are scrapers that flawlessly simulate human behavior.

3. Browser plugins

Scraping tools use the browser you already use (by default) to make it look even more human. This way of scraping is also detectable, because actions are performed automatically. Take the automation tool for LinkedIn, DuxSoup as an example. When a LinkedIn profile visits 500 other profiles in one day, LinkedIn will recognize this. There is no one who visits so many profiles manually.