visit
TL;DR: We’ve released the — an open-source Node.js library for scraping and web crawling. There was one for Python, but until now, there was no such library for JavaScript, THE language of the web.
In Python there is , a de facto standard toolkit for building web scrapers and crawlers. But in JavaScript, there was no similarly comprehensive and universal library. But that’s not right. An ever-increasing number of websites use JavaScript to fetch and render user content. To extract data from these websites, you’ll often need to use an actual web browser to parse the HTML and run page scripts, and then inject your data extraction code that will run in the browser context, i.e. you need JavaScript. So what’s the point of using another programming language to manage the “server-side” of the crawler?
Let’s just ignore the fact that JavaScript is the most widely used programming language in the world, according to the . We won’t mention this fact in this blog post. There’s another good reason to use JavaScript for web scraping. In April 2017, Google launched and soon after they released — a Node.js API for headless Chrome. This was nothing short of a revolution in the web scraping world, because previously the only options to instrument a full-featured web browser were to use , which is badly outdated, use , which is based on QtWebKit and only has Python SDK, or use , whose API is limited by the need to support all browsers, including Internet Explorer, Firefox, Safari, etc. While headless Chrome made it possible to run web browsers on servers without the need to use X Virtual Framebuffer (Xvfb), Puppeteer provided the simplest and most powerful high-level API for a web browser that ever existed.The is an open-source library that simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or in the cloud, rotate proxies and much more. The library is available as the [apify](//www.npmjs.com/package/apify)
package on NPM. It can be used either stand-alone in your own applications or in running on the .
The script performs a recursive deep crawl of using Puppeteer. The number of Puppeteer processes and tabs is automatically controlled depending on the available CPU and memory in your system. By the way, this functionality is exposed separately as the [AutoscaledPool](//www.apify.com/docs/sdk/apify-runtime-js/latest#AutoscaledPool)
class, so that you can use it to manage a pool of any other resource-intensive asynchronous tasks.
Recursive crawl of using Apify SDK — Terminal view And when you run the crawl in headful mode, you will see how Apify SDK automatically launches and manages Chrome browsers and tabs:
Recursive crawl of using Apify SDK — Chrome browser view Apify SDK provides a number of utility classes that are helpful for common web scraping and automation tasks, e.g for management of URLs to crawl, data storage and various crawler skeletons. To get a better idea, have a look at .