Simple Web Scraping With Javascript

May 1, 2018

Sometimes you need to scrape content from a website and a fancy scraping setup would be overkill.

Maybe you only need to extract a list of items on a single page, for example.

In these cases you can just manipulate the DOM right in the Chrome developer tools.

Extract List Items From a Wikipedia Page

Let's say you need this list of baked goods in a format that's easy to consume: https://en.wikipedia.org/wiki/List_of_baked_goods

Open Chrome DevTools and copy the following into the console:

JSON.stringify([...document.querySelectorAll('ul > li > a')].map(r => r.textContent))

Now you can select the JSON output and copy it to your clipboard.

A More Complicated Example

Let's try to get a list of companies from AngelList (https://angel.co/companies?company_types[]=Startup&locations[]=1688-United+States

This case is a slightly less straightforward because we need to click "more" at the bottom of the page to fetch more search results.

Open Chrome DevTools and copy:

;(function () {
  let loop = () => setTimeout(() => {
    let list = [...document.querySelectorAll('a.startup-link')]
      .map(a => a.textContent)
      .filter(text => text !== '')
    // programmatically click the "more" button
    // to fetch more companies 
    document.querySelector('.more').click()
    console.log('length: ' + list.length) 
    // save results in local storage
    window.localStorage.setItem('__companies__', JSON.stringify(list))
    // run again in 0 - 3 seconds
    loop()
  }, 3000 * Math.random())
  loop()
})()

You can access the results with:

window.localStorage.getItem('__companies__')

Some Notes

Chrome natively supports ES6 so we can use things like the spread operator
- We spread [...document.querySelectorAll] because it returns a node list and we want a plain old array.
We wrap everything in a setTimeout loop so that we don't overwhelm Angel.co with requests
We save our results in localStorage with window.localStorage.setItem('__companies__', JSON.stringify(arr)) so that if we disconnect or the browser crashes, we can go back to Angel.co and our results will be saved.
We must serialize data before saving it to localStorage.

Scraping With Node

These examples are fun but what about scraping entire websites?

We can use node-fetch and JSDOM to do something similar.

const fetch = require('node-fetch')
const JSDOM = require('jsdom').JSDOM
let selector = 'ul > li > a'
let url = 'https://en.wikipedia.org/wiki/List_of_baked_goods'
fetch(url)
  .then(resp => resp.text())
  .then(text => {
    let dom = new JSDOM(text)
    let { document } = dom.window; 
    let list = [...document.querySelectorAll(selector)]
      .map(a => a.textContent)
    console.log(list)
   })

Just like before, we're not using any fancy scraping API, we're "just" using the DOM API. But since this is node we need JSDOM to emulate a browser.

Scraping With NightmareJs

Nightmare is a browser automation library that uses electron under the hood.

The idea is that you can spin up an electron instance, go to a webpage and use nightmare methods like type and click to programmatically interact with the page.

For example, you'd do something like the following to login to a Wordpress site programmatically with nightmare:

let BASE_URL = ''
let WP_USER = ''
let WP_PASSWORD = ''
function wpLogin() {
   nightmare.goto(`${process.env.BASE_URL}/wp-admin`)
    .wait(1000)
    .type("input#user_login", WP_USER)
    .type("input#user_pass", WP_PASSWORD)
    .click("p.submit input#wp-submit")
}

Nightmare is a fun library and might seem like "magic" at first.

But the NightmareJs methods like wait, type, click, are just syntactic sugar on DOM (or virtual DOM) manipulation.

For example, here's the source for the nightmare method refresh:

exports.refresh = function(done) {
  this.evaluate_now(function() {
    window.location.reload();
  }, done);
};

In other words, window.location.reload wrapped in their evaluate_now method. So with nightmare, we are spinning up an electron instance (a browser window), and then manipulating the DOM with client-side javascript. Everything is the same as before, except that nightmare exposes a clean and tidy API that we can work with.

Why Do We Need Electron?

Why is Nightmare built on electron? Why not just use Chrome?

This brings us to the interesting alternative to nightmare, Chromeless.

Chromeless attempts to duplicate Nightmare's simple browser automation API using Chrome Canary instead of Electron.

This has a few interesting benefits, the most important of which is that Chromeless can be run on AWS Lambda. It turns out that the precompiled electron binaries are just too large to work with Lambda.

Here's the same example we started with (scraping companies from Angel.co), using Chromeless:

const url = 'https://angel.co/companies'
const { Chromeless } = require('chromeless')
async function run() {
  const chromeless = new Chromeless()
  const value = await chromeless
    .goto(url)
    .evaluate(async () => {
      let data = await new Promise((resolve, reject) => {
        let cycles = 0
        let loop = () => setTimeout(() => {
          if (cycles > 10) {
            let list = [...document.querySelectorAll('a.startup-link')]
              .map(a => a.textContent)
              .filter(text => text !== '')
            resolve(JSON.stringify(list))
          }
          cycles += 1  
          document.querySelector('.more').click()
          loop()
        }, 3000 * Math.random())
      
        loop()
      })
      return data
    })
  await chromeless.end()
}
run().catch(console.error.bind(console))

To run the above example, you'll need to install Chrome Canary locally. Here's the download link.

alias canary="/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary"
canary --remote-debugging-port=9222

Next, run the above two commands to start Chrome canary headlessly.

Finally, install the npm package chromeless.

npm i chromeless