{"componentChunkName":"component---src-components-post-tsx","path":"/simple-web-scraping","result":{"pageContext":{"searchData":[{"id":"cjjg9ui0t303s010375m42b9w","title":"Turn a Browser Window into a Notepad with This One-Liner","slug":"/turn-a-browser-window-into-a-notepad-with-this-one-liner","avatar":"https://s3.amazonaws.com/contentkit/static/cjjg9ui0t303s010375m42b9w/OYIaJ1KK_400x400(1).png","date":"July 1, 2020"},{"id":"cjiy7xl9o17w701039gvl18a5","title":"Creating a Collaborative Editor with Draftjs for Fun","slug":"/draft-js-collaborative-editor","avatar":"https://s3.amazonaws.com/contentkit/static/cjiy7xl9o17w701039gvl18a5/draft-js.png","date":"June 28, 2020"},{"id":"cjeqgfsdsnmx90167n7j290kt","title":"How To Include SASS In Your React Project","slug":"/sass-react-webpack","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfsdsnmx90167n7j290kt/parcel-bundler.png","date":"May 1, 2020"},{"id":"cjiy4tseq0tnw0103bc5s7sxj","title":"How to Add a Loading Indicator to Material Ui's Component","slug":"/creating-a-material-ui-button-with-spinner-that-reflects-loading-state","avatar":"https://s3.amazonaws.com/contentkit/static/cjiy4tseq0tnw0103bc5s7sxj/material-ui.png","date":"January 28, 2020"},{"id":"cjkq4u7470ihr0157ad175b12","title":"Connecting to AWS Lambda via WebSockets","slug":"/connecting-to-aws-lambda-via-websockets","avatar":"https://s3.amazonaws.com/contentkit/static/cjkq4u7470ihr0157ad175b12/MQTT.js.png","date":"January 12, 2020"},{"id":"cjeqgfnhknnpe0199zbfwwkd8","title":"6 Really Cool APIs to Have Fun With","slug":"/cool-apis-to-have-fun-with","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfnhknnpe0199zbfwwkd8/scale-api.png","date":"January 1, 2020"},{"id":"cjeqgftvknnr30199dvw266qh","title":"Installing Node Canvas in AWS Lambda","slug":"/node-canvas-aws-lambda","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgftvknnr30199dvw266qh/aws-lambda.jpg","date":"December 31, 2019"},{"id":"cjn5fv8kl0pfj0124wwebufms","title":"GoLang Cheatsheet","slug":"/golang-cheatsheet","avatar":"https://s3.amazonaws.com/contentkit/static/cjn5fv8kl0pfj0124wwebufms/golang.png","date":"October 13, 2019"},{"id":"cjn15y8740liw0175u3s707eh","title":"Scheduled Jobs & Work Queues With Postgresql","slug":"/scheduled-jobs-work-queues-with-postgresql","avatar":"https://s3.amazonaws.com/contentkit/static/cjn15y8740liw0175u3s707eh/kxHkAenZ_400x400.jpg","date":"October 8, 2019"},{"id":"cjmym86fh0jy1013398nfzuqu","title":"Creating a Bot to Refill Parking Meters Using AWS Lambda","slug":"/creating-a-bot-to-refill-parking-meters-using-aws-lambda","avatar":"https://s3.amazonaws.com/contentkit/static/cjmym86fh0jy1013398nfzuqu/prwHMlRn_400x400.jpg","date":"October 7, 2019"},{"id":"cjmsaqt6l09le01046qeukpp9","title":"The USPS Tracking API: How To Track Packages","slug":"/usps-tracking-api","avatar":"https://s3.amazonaws.com/contentkit/static/cjmsaqt6l09le01046qeukpp9/usps.png","date":"October 3, 2019"},{"id":"cjeqgfuyenmy20167rycnmiql","title":"Stormpath vs Firebase - A Side-By-Side Comparison","slug":"/stormpath-vs-firebase-a-side-by-side-comparison","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfuyenmy20167rycnmiql/m3cEA33V_400x400.jpg","date":"September 2, 2019"},{"id":"cjeqgfv88nnrt01997wxd4rnl","title":"Sending emails with Firebase Cloud Functions","slug":"/firebase-functions-sending-emails","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfv88nnrt01997wxd4rnl/firebase.jpg","date":"September 2, 2019"},{"id":"cjeqgfqjknnq20199oe3zwi37","title":"Deploying To DigitalOcean From Travis","slug":"/deploying-to-digitalocean-travis","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfqjknnq20199oe3zwi37/digitalocean.jpeg","date":"August 25, 2019"},{"id":"cjkvpbr6p06z50181fg7xvsc3","title":"Circumventing AWS Lambda's Bundle Size Limit","slug":"/circumventing-aws-lambdas-bundle-size-limit","avatar":"https://s3.amazonaws.com/contentkit/static/cjkvpbr6p06z50181fg7xvsc3/b2lWK7c0_400x400.png","date":"August 15, 2019"},{"id":"cjiy45h7m0iba0111f5imt5em","title":"A Quick Way to List All Unicode Characters (Javascript)","slug":"/unicode-characters-javascript","avatar":"https://s3.amazonaws.com/contentkit/static/cjiy45h7m0iba0111f5imt5em/unicode.png","date":"July 29, 2019"},{"id":"ci1rxc41tm8yi7nd156pz9dom","title":"Kibana Rest API","slug":"/kibana-rest-api","avatar":"https://s3.amazonaws.com/contentkit/static/ci1rxc41tm8yi7nd156pz9dom/elastic.png","date":"July 24, 2019"},{"id":"cjeqgfjgennmw0199wha3j6u0","title":"Airtable As A Database For Middleman [Tutorial]","slug":"/airtable-middleman-database","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfjgennmw0199wha3j6u0/airtable-books.png","date":"June 1, 2019"},{"id":"cjeqgfj5qnnmr0199vzd85xuo","title":"Building A Multiplayer Game With Three.Js + WebSockets","slug":"/multiplayer-game-threejs","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfj5qnnmr0199vzd85xuo/three.jpg","date":"June 1, 2019"},{"id":"cjeqgfi8hnnml0199f1jyo1p5","title":"The Best Sketch Plugins","slug":"/the-best-sketch-plugins","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfi8hnnml0199f1jyo1p5/sketch.png","date":"June 1, 2019"},{"id":"d9920hsu8y7365frf2c6fxrx2","title":"Configuring Vault by Hashicorp in AWS EC2","slug":"/configuring-vault-by-hashicorp-in-aws-ec2","avatar":"https://s3.amazonaws.com/contentkit/static/d9920hsu8y7365frf2c6fxrx2/vault.png","date":"April 15, 2019"},{"id":"cjeqgfvsfnns00199mlbff48i","title":"Best Zapier Alternatives","slug":"/best-zapier-alternatives","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfvsfnns00199mlbff48i/zapier-alternative-workato.png","date":"January 31, 2019"},{"id":"cjn3l18390lyt0158i8qxs5dc","title":"Running A Simple Node Web Server On AWS EC2","slug":"/running-a-simple-node-web-server-on-aws-ec2","avatar":"https://s3.amazonaws.com/contentkit/static/cjn3l18390lyt0158i8qxs5dc/aws.png","date":"October 11, 2018"},{"id":"cjeqgfr4snmwz01673m8l4i51","title":"Graph.Cool vs Firebase","slug":"/graphcool-vs-firebase","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfr4snmwz01673m8l4i51/graphcool.png","date":"October 7, 2018"},{"id":"cjklurup00k0201739mflg9sz","title":"Accessing Redis on an Aws EC2 Instance from the Outside\t","slug":"/accessing-redis-on-an-aws-ec2-instance-from-the-outside","avatar":"https://s3.amazonaws.com/contentkit/static/cjklurup00k0201739mflg9sz/logo.jpeg","date":"August 9, 2018"},{"id":"cjkfzvvxt08up0191l38aadq4","title":"Using PDFtk in AWS Lamba","slug":"/using-pdftk-in-aws-lamba","avatar":"https://s3.amazonaws.com/contentkit/static/cjkfzvvxt08up0191l38aadq4/pdftk.png","date":"August 4, 2018"},{"id":"cjeqgfqsgnnq70199l4vc6vim","title":"Configuring WebSockets on Elastic Beanstalk/EC2","slug":"/websockets-aws-elasticbeanstalk-ec2","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfqsgnnq70199l4vc6vim/aws.png","date":"July 15, 2018"},{"id":"cjit96q2yk0is0111qpbmw66s","title":"Getting Started With Gmail API","slug":"/gmail-api-quickstart","avatar":"https://s3.amazonaws.com/contentkit/static/cjit96q2yk0is0111qpbmw66s/gmail.png","date":"June 28, 2018"},{"id":"cjeqgfo23nmwd0167ynqe7xi8","title":"5 Tips For Using NextJs","slug":"/5-tips-for-using-nextjs","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfo23nmwd0167ynqe7xi8/bHjpwZem_400x400.png","date":"June 1, 2018"},{"id":"cjhxixcyhw4ze01035butoukz","title":"React unstable_deferredUpdates","slug":"/react-unstable_deferredupdates","avatar":"https://s3.amazonaws.com/contentkit/static/cjhxixcyhw4ze01035butoukz/OYIaJ1KK_400x400(1).png","date":"June 1, 2018"},{"id":"cjeqgflpnnnoy01993y7aez8a","title":"Simple Web Scraping With Javascript","slug":"/simple-web-scraping","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgflpnnnoy01993y7aez8a/developer-tools-scraping.png","date":"May 1, 2018"},{"id":"cjeqgfk9lnnn101994ll4ursr","title":"Easy: Add Firebase Facebook Login To Your React App","slug":"/firebase-facebook-login-react","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfk9lnnn101994ll4ursr/firebase.png","date":"May 1, 2018"},{"id":"cjeqgfiqrnmtj0167dmz5g5dg","title":"The 5 Best Static Site Web Hosts","slug":"/the-5-best-static-site-web-hosts","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfiqrnmtj0167dmz5g5dg/digital-ocean.png","date":"May 1, 2018"},{"id":"cjeqgfmi2nmvs01670pgebekn","title":"Tips and Tricks For Using NightmareJs","slug":"/nightmare-js","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfmi2nmvs01670pgebekn/nightmare.png","date":"May 1, 2018"},{"id":"cjeqgfkqennn60199707yp1fb","title":"Why You Should Create Your Next React Web App With Firebase","slug":"/firebase-react-tutorial","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfkqennn60199707yp1fb/firebase.jpg","date":"May 1, 2018"},{"id":"cjeqgfpr6nnpx019978qzmaok","title":"How to Flush Data From Heroku Redis","slug":"/heroku-redis-flushall","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfpr6nnpx019978qzmaok/3wgIDj3j_400x400.png","date":"April 1, 2018"},{"id":"cjeqgfm8xnnp30199vxndhkgh","title":"5 Ways To Style React Components","slug":"/5-ways-to-style-react-components","avatar":null,"date":"April 1, 2018"},{"id":"cjeqgflz7nmvm0167yo7wsjg3","title":"Delete Spreadsheet Rows For Google Sheets","slug":"/delete-rows-google-sheets","avatar":null,"date":"April 1, 2018"},{"id":"cjeqgfp84nnps0199dvru09ll","title":"Make An Uptime Monitoring Microservice In Under 50 Lines of Code","slug":"/twilio-uptime-monitoring-node-tutorial","avatar":null,"date":"April 1, 2018"},{"id":"cjeqgfri4nnqd0199zsv6a9fl","title":"Hexo - The Best Static Site Generator? ","slug":"/deploy-a-hexo-blog","avatar":null,"date":"April 1, 2018"},{"id":"cjeqgfoygnnpn0199f31dgk9m","title":"Airtable As A Minimum Viable Database For Your ReactJs Project","slug":"/airtable-reactjs","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfoygnnpn0199f31dgk9m/airtable-screenshot.png","date":"March 2, 2018"},{"id":"cjeqgfn6pnmw0016793f81hpx","title":"The Advanced Guide To ReactJs Checkboxes","slug":"/reactjs-checkboxes","avatar":null,"date":"January 20, 2018"},{"id":"cjeqgfe8znnly01991jo5xkxx","title":"Minimum Viable GraphQL QuickStart","slug":"/minimum-viable-graphql-quickstart","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgfe8znnly01991jo5xkxx/graphcool-schema.png","date":"January 1, 2018"},{"id":"cjeqgfgegnnmb0199jp3hv2xx","title":"How To Add A Contact Form To Your Ghost Blog","slug":"/ghost-blog-contact-form","avatar":null,"date":"November 1, 2017"},{"id":"cjeqgfmvxnnp90199bdti4r8n","title":"PHP Scraping Tutorial - Scrape Reddit With Goutte","slug":"/php-scraping-tutorial-scrape-reddit-with-goutte","avatar":null,"date":"October 7, 2017"},{"id":"cjeqgfpz1nmwp0167c5j46eij","title":"Cannot read property 'loose' of undefined","slug":"/cannot-read-property-loose-of-undefined","avatar":null,"date":"August 25, 2017"},{"id":"cjeqgftlanmxg0167zipvfzp1","title":"Airtable API Example & Tutorial - Generating Charts","slug":"/airtable-api-example-tutorial","avatar":null,"date":"January 31, 2017"},{"id":"cjeqgfphbnmwk0167ldz2mv2q","title":"React onClick Example and Tutorial","slug":"/react-onclick-example-and-tutorial","avatar":null,"date":"January 2, 2017"},{"id":"cjeqgfq8ynmwu0167z6c51ynp","title":"React Tables - How To Render Tables In ReactJS","slug":"/reactjs-tables","avatar":null,"date":"January 2, 2017"},{"id":"cjeqgfuoznmxu0167ayt2zkc3","title":"How To Add A Class in ReactJS","slug":"/reactjs-add-class","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgfu6gnnrb0199nyrddra0","title":"Fetching Github Blame with the GraphQL API V4","slug":"/github-blame-graphql-api-v4","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgfsmsnnqo0199o2x7dl4x","title":"How To Add Meta Descriptions to Middleman Pages","slug":"/middleman-meta-descriptions","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgfs2wnnqi0199rco2vvz2","title":"Zapier Webhook Post Example & Tutorial","slug":"/zapier-webhook-post-example-tutorial","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgfvhunmy90167t6ouqfmb","title":"Setting Up A Job Queue For A Node App [Tutorial]","slug":"/job-queue-node","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgftbcnnqy0199nc1f92te","title":"NextJs vs Create-React-App - A Side-By-Side Comparison","slug":"/nextjs-vs-create-react-app","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgft2pnnqt01990ty8fmeu","title":"How To Create A Modal In ReactJS [Tutorial]","slug":"/reactjs-modal","avatar":null,"date":"March 14, 2018"},{"id":"cjeqgfugpnnrj0199x7anb33t","title":"How To Use Twilio With ReactJS","slug":"/reactjs-twilio-example-tutorial","avatar":null,"date":"March 14, 2018"}],"post":{"id":"cjeqgflpnnnoy01993y7aez8a","title":"Simple Web Scraping With Javascript","slug":"simple-web-scraping","published_at":"2018-05-01T16:00:00","created_at":"2018-03-14T02:16:32","excerpt":"Sometimes you need to scrape content from a website and a fancy scraping setup would be overkill.","image":{"id":"cjgclnlt9z13x0179hopx0hwr","url":"static/cjeqgflpnnnoy01993y7aez8a/developer-tools-scraping.png"},"posts_tags":[{"tag":{"id":"2spo553zykbgxgq4vkk3","name":"Javascript"}}],"date":"May 1, 2018","html":"
Sometimes you need to scrape content from a website and a fancy scraping setup would be overkill.
\nMaybe you only need to extract a list of items on a single page, for example.
\nIn these cases you can just manipulate the DOM right in the Chrome developer tools.
\n\nLet's say you need this list of baked goods in a format that's easy to consume: https://en.wikipedia.org/wiki/List_of_baked_goods
\nOpen Chrome DevTools and copy the following into the console:
\nJSON.stringify([...document.querySelectorAll('ul > li > a')].map(r => r.textContent))
\nNow you can select the JSON output and copy it to your clipboard.
\nLet's try to get a list of companies from AngelList (https://angel.co/companies?company_types[]=Startup&locations[]=1688-United+States
\nThis case is a slightly less straightforward because we need to click \"more\" at the bottom of the page to fetch more search results.
\nOpen Chrome DevTools and copy:
\n;(function () {\n let loop = () => setTimeout(() => {\n let list = [...document.querySelectorAll('a.startup-link')]\n .map(a => a.textContent)\n .filter(text => text !== '')\n // programmatically click the \"more\" button\n // to fetch more companies \n document.querySelector('.more').click()\n console.log('length: ' + list.length) \n // save results in local storage\n window.localStorage.setItem('__companies__', JSON.stringify(list))\n // run again in 0 - 3 seconds\n loop()\n }, 3000 * Math.random())\n loop()\n})()
\nYou can access the results with:
\nwindow.localStorage.getItem('__companies__')
\n[...document.querySelectorAll]
because it returns a node list and we want a plain old array.window.localStorage.setItem('__companies__', JSON.stringify(arr))
so that if we disconnect or the browser crashes, we can go back to Angel.co and our results will be saved.These examples are fun but what about scraping entire websites?
\nWe can use node-fetch and JSDOM to do something similar.
\nconst fetch = require('node-fetch')\nconst JSDOM = require('jsdom').JSDOM\nlet selector = 'ul > li > a'\nlet url = 'https://en.wikipedia.org/wiki/List_of_baked_goods'\nfetch(url)\n .then(resp => resp.text())\n .then(text => {\n let dom = new JSDOM(text)\n let { document } = dom.window; \n let list = [...document.querySelectorAll(selector)]\n .map(a => a.textContent)\n console.log(list)\n })
\nJust like before, we're not using any fancy scraping API, we're \"just\" using the DOM API. But since this is node we need JSDOM to emulate a browser.
\nNightmare is a browser automation library that uses electron under the hood.
\nThe idea is that you can spin up an electron instance, go to a webpage and use nightmare methods like type and click to programmatically interact with the page.
\nFor example, you'd do something like the following to login to a Wordpress site programmatically with nightmare:
\nlet BASE_URL = ''\nlet WP_USER = ''\nlet WP_PASSWORD = ''\nfunction wpLogin() {\n nightmare.goto(`${process.env.BASE_URL}/wp-admin`)\n .wait(1000)\n .type(\"input#user_login\", WP_USER)\n .type(\"input#user_pass\", WP_PASSWORD)\n .click(\"p.submit input#wp-submit\")\n}
\nNightmare is a fun library and might seem like \"magic\" at first.
\nBut the NightmareJs methods like wait, type, click, are just syntactic sugar on DOM (or virtual DOM) manipulation.
\nFor example, here's the source for the nightmare method refresh:
\nexports.refresh = function(done) {\n this.evaluate_now(function() {\n window.location.reload();\n }, done);\n};
\nIn other words, window.location.reload wrapped in their evaluate_now method. So with nightmare, we are spinning up an electron instance (a browser window), and then manipulating the DOM with client-side javascript. Everything is the same as before, except that nightmare exposes a clean and tidy API that we can work with.
\nWhy is Nightmare built on electron? Why not just use Chrome?
\nThis brings us to the interesting alternative to nightmare, Chromeless.
\nChromeless attempts to duplicate Nightmare's simple browser automation API using Chrome Canary instead of Electron.
\nThis has a few interesting benefits, the most important of which is that Chromeless can be run on AWS Lambda. It turns out that the precompiled electron binaries are just too large to work with Lambda.
\nHere's the same example we started with (scraping companies from Angel.co), using Chromeless:
\nconst url = 'https://angel.co/companies'\nconst { Chromeless } = require('chromeless')\nasync function run() {\n const chromeless = new Chromeless()\n const value = await chromeless\n .goto(url)\n .evaluate(async () => {\n let data = await new Promise((resolve, reject) => {\n let cycles = 0\n let loop = () => setTimeout(() => {\n if (cycles > 10) {\n let list = [...document.querySelectorAll('a.startup-link')]\n .map(a => a.textContent)\n .filter(text => text !== '')\n resolve(JSON.stringify(list))\n }\n cycles += 1 \n document.querySelector('.more').click()\n loop()\n }, 3000 * Math.random())\n \n loop()\n })\n return data\n })\n await chromeless.end()\n}\nrun().catch(console.error.bind(console))\n
\nTo run the above example, you'll need to install Chrome Canary locally. Here's the download link.
\nalias canary=\"/Applications/Google\\ Chrome\\ Canary.app/Contents/MacOS/Google\\ Chrome\\ Canary\"\ncanary --remote-debugging-port=9222\n
\nNext, run the above two commands to start Chrome canary headlessly.
\nFinally, install the npm package chromeless.
\nnpm i chromeless\n
","avatar":"https://s3.amazonaws.com/contentkit/static/cjeqgflpnnnoy01993y7aez8a/developer-tools-scraping.png"}}}}