I needed to scrape Reddit posts for a project. After having fun with some tools I decided to document the process. However I my process is generalizable to scraping anything.
First, what is a web scraper? You could manually copy and paste data from websites into a spreadsheet, but that would take far too long if you are working on a big project.
Scraping lets you programmatically copy data at a high throughput.
For this project, I chose PHP because it has some extra syntactic sugar.
The best (php) scraping library is Goutte.
First, you'll need to install Compose, which is a package management tool for PHP. In terminal type:
# Install Composer
curl -sS https://getcomposer.org/installer | php
Now add Goutte as a dependency to your project.
composer require fabpot/goutte
But Goutte requires Guzzle, so you'll need to install that too.
Guzzle is a PHP HTTP client that makes it easy to send HTTP requests and trivial to integrate with web services.
php composer.phar require guzzlehttp/guzzle
Now if you open composer.json it should look something like:
{
"name": "my_project",
"require": {
"fabpot/goutte": "^3.2",
"guzzlehttp/guzzle": "^6.2"
}
}
After installing Goutte and Guzzle, you need to include composer's autoloader.
So create a file index.php and at the top include:
require 'vendor/autoload.php';
To scrape stuff with Guzzle, you only need to write a few lines of code. For example:
require 'vendor/autoload.php';
use Goutte\Client;
$url = "http://www.reddit.com";
$css_selector = "a.title.may-blank";
$thing_to_scrape = "_text";
$client = new Client();
$crawler = $client->request('GET', $url);
$output = $crawler->filter($css_selector)->extract($thing_to_scrape);
var_dump($output);
With this PHP snipped, we:
This is the easy part. Using Chrome's developer tools you can inspect any element to reveal its attributes.
On Reddit, I right-clicked a post title.
And noted the link element's classes, title, may-blank, and outbound.
Return to terminal and type php index.php to run the PHP script.
In terminal you should see output like:
array(25) {
[0]=>
string(17) "from sad to happy"
[1]=>
string(183) "The president of the Philippines Rodrigo Duterte should be investigated for murder after boasting he "personally" killed three suspected criminals, a top United Nations official said."
This is just an array that comprises all of the Reddit post titles on the front page.
The power of this approach is that we can target the elements to scrape using a css selector.
This approach is reminiscent of javascript and Jquery where you can manipulate the DOM using similar syntax, e.g.:
var title = $('a.title.may-blank').text();
In the previous Reddit we scraped the post titles on the front page of Reddit. Therefore we used _text
to get the text. Here are some other things we can fetch:
Note: Symphony's DomCrawler supports special link.
So instead of getting the attribute with href
you can also write:
$output = $crawler->filter('a.title.may-blank')->link();
$uri = $link->getUri();
The getUri() is especially useful as it cleans the href value and transforms it into how it should really be processed. For example, for a link with href="#foo", this would return the full URI of the current page suffixed with #foo. The return from getUri() is always a full URI that you can act on.
Link to section in Symfony DOMCrawler Docs.
We can scrape many attribute values at once. For example, this line fetches the text, class and url of an element:
$output = $crawler->filter($css_selector)->extract(array('_text', 'class', 'href'));
For the Reddit example, the above code would return:
[0]=>
array(3) {
[0]=>
string(17) "from sad to happy"
[1]=>
string(24) "title may-blank outbound"
[2]=>
string(30) "http://i.imgur.com/P45maQC.gif"
}
Consider once again Reddit's front page. There are a bunch of posts. What if we only want to select the second post and not all of them? We can access node by its position on the list:
$output = $crawler
->filter('a.title.may-blank') // CSS selector
->eq(1) // node position
->extract('_text'); // DOM attribute to extract
Scraping multiple pages is easy with Goutte. We just need a foreach loop.
First we create an array of links:
// links to scrape
$url = array(
'https://www.reddit.com',
'https://www.reddit.com/new/',
'https://www.reddit.com/rising/'
);
And here's what the foreach loop might look like:
require 'vendor/autoload.php';
use Goutte\Client;
$url = array(
'https://www.reddit.com',
'https://www.reddit.com/new/',
'https://www.reddit.com/rising/'
);
$selector = "a.title.may-blank";
$attribute = "_text";
foreach($url as $key => $value) {
$client = new Client();
// on the first iteration, $value = 'https://www.reddit.com', on the second $value = 'https://www.reddit.com/new/', and so forth.
$crawler = $client->request('GET', $value);
$output[$key] = $crawler
->filter($selector)
->eq(1) // scrape the second post only
->extract($attribute);
}
var_dump($output);
Clone this repository for a quick start: https://github.com/unshift/goutte-php-scraper-boilerplate/blob/master/index.php#L25
That's it! Now go forth and scrape responsibly.