PHP Scraping Tutorial - Scrape Reddit With Goutte

October 7, 2017

I needed to scrape Reddit posts for a project. After having fun with some tools I decided to document the process. However I my process is generalizable to scraping anything.

First, what is a web scraper? You could manually copy and paste data from websites into a spreadsheet, but that would take far too long if you are working on a big project.

Scraping lets you programmatically copy data at a high throughput.

Scraping Libraries

For this project, I chose PHP because it has some extra syntactic sugar.

The best (php) scraping library is Goutte.

Installing Goutte

First, you'll need to install Compose, which is a package management tool for PHP. In terminal type:

# Install Composer
curl -sS https://getcomposer.org/installer | php

Now add Goutte as a dependency to your project.

composer require fabpot/goutte

But Goutte requires Guzzle, so you'll need to install that too.

Installing Guzzle

Guzzle is a PHP HTTP client that makes it easy to send HTTP requests and trivial to integrate with web services.

php composer.phar require guzzlehttp/guzzle

Now if you open composer.json it should look something like:

{
    "name": "my_project",
    "require": {
        "fabpot/goutte": "^3.2",
        "guzzlehttp/guzzle": "^6.2"
    }
}

After installing Goutte and Guzzle, you need to include composer's autoloader.

So create a file index.php and at the top include:

require 'vendor/autoload.php';

Scraping Your First Page

Example: Scraping Reddit's Front Page

To scrape stuff with Guzzle, you only need to write a few lines of code. For example:

require 'vendor/autoload.php';
use Goutte\Client;
$url = "http://www.reddit.com";
$css_selector = "a.title.may-blank";
$thing_to_scrape = "_text";
$client = new Client();
$crawler = $client->request('GET', $url);
$output = $crawler->filter($css_selector)->extract($thing_to_scrape);
var_dump($output);

With this PHP snipped, we:

required composer's autoloader on line 1.
specified that we wanted to use Goutte
specified a $url to scrape
specified a CSS selector ('a.title.may-blank') to tell Goutte which DOM element to scrape

Getting The Right CSS Selector

This is the easy part. Using Chrome's developer tools you can inspect any element to reveal its attributes.

On Reddit, I right-clicked a post title.

And noted the link element's classes, title, may-blank, and outbound.

Proving To Yourself That It Works

Return to terminal and type php index.php to run the PHP script.

In terminal you should see output like:

array(25) {
  [0]=>
  string(17) "from sad to happy"
  [1]=>
  string(183) "The president of the Philippines Rodrigo Duterte should be investigated for murder after boasting he "personally" killed three suspected criminals, a top United Nations official said."

This is just an array that comprises all of the Reddit post titles on the front page.

The Power of Goutte

The power of this approach is that we can target the elements to scrape using a css selector.

This approach is reminiscent of javascript and Jquery where you can manipulate the DOM using similar syntax, e.g.:

var title = $('a.title.may-blank').text();

Target Any Attribute

In the previous Reddit we scraped the post titles on the front page of Reddit. Therefore we used _text to get the text. Here are some other things we can fetch:

href - scrape a url
src - scrape a link to an image
class - scrape a CSS class
_text - scrape text

Note: Symphony's DomCrawler supports special link.

So instead of getting the attribute with href you can also write:

$output = $crawler->filter('a.title.may-blank')->link();
$uri = $link->getUri();

The getUri() is especially useful as it cleans the href value and transforms it into how it should really be processed. For example, for a link with href="#foo", this would return the full URI of the current page suffixed with #foo. The return from getUri() is always a full URI that you can act on.

Link to section in Symfony DOMCrawler Docs.

In One Fell Swoop

We can scrape many attribute values at once. For example, this line fetches the text, class and url of an element:

$output = $crawler->filter($css_selector)->extract(array('_text', 'class', 'href'));

For the Reddit example, the above code would return:

  [0]=>
  array(3) {
    [0]=>
    string(17) "from sad to happy"
    [1]=>
    string(24) "title may-blank outbound"
    [2]=>
    string(30) "http://i.imgur.com/P45maQC.gif"
  }

Node Traversing

Consider once again Reddit's front page. There are a bunch of posts. What if we only want to select the second post and not all of them? We can access node by its position on the list:

$output = $crawler
    ->filter('a.title.may-blank') // CSS selector
    ->eq(1) // node position
    ->extract('_text');  // DOM attribute to extract

Scraping Multiple Pages

Scraping multiple pages is easy with Goutte. We just need a foreach loop.

First we create an array of links:

// links to scrape
$url = array(
    'https://www.reddit.com',
    'https://www.reddit.com/new/',
    'https://www.reddit.com/rising/'
);

And here's what the foreach loop might look like:

require 'vendor/autoload.php';
use Goutte\Client;
$url = array(
    'https://www.reddit.com',
    'https://www.reddit.com/new/',
    'https://www.reddit.com/rising/'
);
$selector = "a.title.may-blank";
$attribute = "_text";
foreach($url as $key => $value) {
    $client = new Client();
    // on the first iteration, $value = 'https://www.reddit.com', on the second $value = 'https://www.reddit.com/new/', and so forth. 
    $crawler = $client->request('GET', $value);
    $output[$key] = $crawler
        ->filter($selector)
        ->eq(1) // scrape the second post only
        ->extract($attribute);
}
var_dump($output);

Clone this repository for a quick start: https://github.com/unshift/goutte-php-scraper-boilerplate/blob/master/index.php#L25

That's it! Now go forth and scrape responsibly.