Easy Scraping with Node.js Summer 2020

Jul 22 2020 JavaScript 6 minutes read (About 934 words)

Have you ever written about scraping a web page? I have. Because no matter how peaceful and serene your life is, once every few years you’ll want to scrape it up. You’re like, “Oh my god, you don’t have an API? You don’t have an API…” kind of thing.

Then you’d have to find an HTTP client and HTML parser library and install it, and then write it all over the place. But when you actually write the code, didn’t you spend more time analyzing the HTML, parsing it and trying to get the values you want from it, than installing the library and checking the operation of the sample code?

Today’s easy scraping in Node.js is a way to minimize the amount of trial and error involved in that area.

Let’s start with the environment. First, let’s start with the latest version, 14.5.0, as of 2020-07-20.

$ node -v
v14.5.0

Then initialize the project and install a couple of libraries

$ npm init
$ npm install node-fetch jsdom --save-dev

node-fetch](https://github.com/node-fetch/node-fetch) is a library for Node.js that allows you to use fetch in Node.js, just like a web browser. The star on GitHub is 5.3k.

jsdom](https://github.com/jsdom/jsdom) is a library that allows you to build an HTML DOM tree in memory with the same API set as a web browser. The star on GitHub is 14.4k, which is the key to this article.

Now that you have all the necessary libraries, let’s write a script. I chose the Weekly Weather Forecast for Tokyo page from the Japan Meteorological Agency as a sample.

index.mjs

#! /usr/bin/env node

import fetch from 'node-fetch';
import jsdom from 'jsdom';

const { JSDOM } = jsdom;

(async () => {
    const res = await fetch('https://www.jma.go.jp/jp/week/319.html');
    const html = await res.text();
    const dom = new JSDOM(html);
    const document = dom.window.document;
    const nodes = document.querySelectorAll('#infotablefont tr:nth-child(4) td');
    const tokyoWeathers = Array.from(nodes).map(td => td.textContent.trim());
    console.log(tokyoWeathers);
})());

Just look at this and think, “Oh, I see! We’ll get into the details later, but the first point is the lines after the const nodes. The first point is the line after the const nodes. After this line, it’s now possible to run it directly in the web browser.

With traditional scraping, you’d spend a lot of time trying to find a query to get the DOM elements you needed, or trying to process the nodes you got into the list you needed. While it’s impossible to eliminate that trial-and-error process, the developer tools in the web browser can drastically reduce the amount of time spent on trial-and-error by viewing the results in real time.

Once you get the results you want in the developer tools, you can paste the code into a script file, and that’s all you need to do to complete scraping. When you run the script, you’ll get the following results

$ . /index.mjs 
[
  'Cloudy', 'Cloudy with a little rain',
  'Cloudy and partly rainy', 'Cloudy',
  'Cloudy', 'Cloudy and sometimes sunny',
  'Cloudy and sometimes sunny'
]

I hope you’ve discovered that it’s a revolutionary ease of writing compared to the past.

Now, as promised, let’s add a detailed explanation.

#! /usr/bin/env node

I’m adding this because I thought I’d run the script directly from the command line this time. You don’t need this if you want to run it by passing the file to the node command.

import fetch from 'node-fetch';
import jsdom from 'jsdom';

const { JSDOM } = jsdom;

Although it’s nice to be able to use the import notation, note that v14 defaults to using .mjs as the file extension. As for jsdom, it’s tempting to write import { JSDOM } from 'jsdom' directly, but since jsdom doesn’t currently support ES2015 Modules syntax, it’s a sluggish way to write it.

(async () => {
    // ...
})();

We want to use await because we have asynchronous processing, and await itself must be used in the asynchronous function, so we create an unnamed asynchronous function and execute it immediately.

const res = await fetch('https://www.jma.go.jp/jp/week/319.html');
const html = await res.text();

This is a familiar way of writing web programming. It gets the HTML from the asynchronous fetch result as a string. Unlike the responseText of XHR, the text method of the fetch response is asynchronous`, so you have to be careful.

const dom = new JSDOM(html);
const document = dom.window.document;

Now for the best part of this article. You pass HTML as a string to the JSDOM constructor, and it parses it into a DOM tree. There is a window object, familiar in web programming, and a document object in it.

const nodes = document.querySelectorAll('#infotablefont tr:nth-child(4) td');
const tokyoWeathers = Array.from(nodes).map(td => td.textContent.trim());
console.log(tokyoWeathers);

This is the part where you say you’re going to paste what you’ve tested in the developer tools. You can easily find the part corresponding to :nth-child(4) in a web browser. It may be one of the more modern techniques to convert the resulting NodeList object into an Array with Array.from.

That’s the end of the script.

Finally, don’t forget that scraping is a last resort: if your service has an API, you should always use it, and if you have to scrape, be careful not to overload the server.

#Node.js

Easy Scraping with Node.js Summer 2020

Your browser is out-of-date!