Modules to build a Python Crawler

CHANG YI-MING
4 min readAug 9, 2020

Some people might be familiar with this topic since Python is a powerful language on crawling data from websites. In this blog, I’m going to share my experience on this topic and list the modules at the bottom of the article. There are few modules I will share about — such as Scrapy, Selenium, and some other modules. But first, we have to briefly know about web rendering.

Web Rendering

Client Side vs Server Side

There are already tons of discussions on these two ways to build a website, says, their pros and cons. In a crawler’s consideration is how can we get the web elements. So, Server Side Rendering will be perfect for most modules, since the web elements already are returned by the response. On the other hand, there definitely will be some difficulties to deal with Client Side Rendering, and not without mentioning the popular way to build web sites— Single page applications.

Server Side Rendering or Static Web Pages

Request-html & Scrapy

Let’s start with a simple task — get data from the BBC news website. Most of the pages on this website will be static, which means we can get all the data we want from the server’s response. In this case, using Request-html or Scrapy will both be fine.

After reading two modules’ tutorial, it seems Request-html is a little bit easier to handle. However, maintenance will be the main reason why I consider using Scrapy as a better decision; the details will be provided in the next part discussion.

Client Side Rendering

Execjs & Request-html & Scrapy-splash

Later on, I find out there will be problems on crawling pages with dynamic content. This kind of page does client side render by using JavaScript, which makes Scrapy and Request-html unable to scrape. For example, web pages constructed by React without server rendering. Therefore, we start to search for solutions to execute the JavaScript in the web page.

Execjs

It is worth a try to use execjs if the target script on the page is simple. We are able to get the result of the function by providing correct input parameters.

Request-html & Scrapy-Splash

Since the client side rendering is a trending technique, these modules are supposed to face circumstances with missing data frequently. So, I try to find ways that are provided by these two module’s authors or contributors.

Then I found in the JavaScript Support chapter Request-html did provide a rendering function in its module.

>>> r = session.get('http://python-requests.org/')

>>> r.html.render()

However, there are few issues that haven’t been solved yet in the issue list causing us unable to render some web pages.

Next, I start looking for answers like How to execute JavaScript with Scrapy?. I then found a tool that is also mentioned in Scrapy’s official document — Scrapy-splash. Scrapy-splash runs in another docker service, with simple settings. We can also test the target web page by input it into the Scrapy-splash webpage as below.

Accordingly, Scrapy will be my choice, since it has outstanding maintenance. It seems like we solve all the problems, however, there are still some pages that splash cannot execute properly.

Render by Browser Automation

Selenium & pyppeteer

There are web pages that their API can only be requested by their own authorized domain. If we cannot make a successful request, missing content on some web pages will be unavoidable. While looking into the request-html module, I found that the render function is actually implemented by using chrome/chromium browser automation library — pyppeteer. Still, with the aspect of reliability, Selenium is a superior choice.

Selenium

Selenium is efficient at crawling all kinds of pages since it can load any pages with the browser’s driver. In other words, it acts the same as we browse a website manually.

Moreover, selenium has already dockerized, and even consider k8s cluster environment — Kubernetes/selenium. The most superior advantage is it can easily crawl multiple pages concurrently by using Selenium Grid.

https://www.browserstack.com/guide/selenium-grid-tutorial

But there still are some imperfections, let’s say chances to be banned while crawling too fast and automatically. So, we should always try to make our program execute like human beings. One more thing is that memory usage definitely will be larger than just scraping by API since the browser is not a small program.

After all, thank you for your attention. Hope we all successfully get our data!
Down below are the models I mentioned above:

- Scrapy
-
Scrapy-splash
-
Selenium
-
Request-html
-
Pyppeteer
-
Execjs

Reference

[1] How to execute JavaScript with Scrapy — Ari Bajo Rouvinen https://www.scrapingbee.com/blog/scrapy-javascript/

--

--

CHANG YI-MING

Software Engineer & UFL MSCS Graduate Student — Sometimes you got to run before you can walk