Build a Web Scraper with jq and pup

16/03/2021

Recently I wanted to build a rudimentary webpage scraper which would run from the command line.

As I progressed I noticed myself having to extensively grep, cut and stitch bits and pieces of logic together.

One of my rules is that if the activity you’re attempting seems fairly standard, someone has probably already done it or at least made utilities to make the activity easier. They’ve probably put more time and effort into it than you have & thus it’s better than your effort would be. Don’t reinvent the wheel.

This led me to two linux libraries: pup and jq.

pup is a handy HTML parser for the command line and jq is the equivalent JSON parser.

By combining these two utilities, it was very easy to achieve my goal without lines and lines of convoluted code.

Scenario

Say I want to extract some information from this page: https://agardner.net/serverless-voting/. Take a look at the source code of that page and you’ll see standard HTML text like this:

<!DOCTYPE html>
<html>
<head>
    <!-- Begin Jekyll SEO tag v2.7.1 -->
    <title>Serverless, Zero Database Voting System | Adam Gardner</title>
    <meta name="generator" content="Jekyll v3.9.0"/>
    <meta property="og:title" content="Serverless, Zero Database Voting System"/>
    <meta name="author" content="Adam Gardner"/>

You’ll also see <script> tags, one of which looks like this:

<script type="application/ld+json">
    {
      "@type":"BlogPosting",
      "@context":"https://schema.org"
      "headline":"Serverless, Zero Database Voting System",
      "dateModified":"2020-06-14T00:00:00+00:00",
      "datePublished":"2020-06-14T00:00:00+00:00",
      "mainEntityOfPage":{
        "@type":"WebPage",
        "@id":"https://agardner.net/serverless-voting/"
      },
      "author":{
        "@type":"Person",
        "name":"Adam Gardner"
      },
      "url":"https://agardner.net/serverless-voting/",
      "description":"I needed a voting system for this website which was compatible with serverless pages. I also wanted it to be zero-login which ruled out using a third-party plugin. The result was a serverless, zero database &amp; zero login voting system using AWS. Here is how…"
    }
</script>

In this scenario, I want to extract the content of the <title> tag and then extract the content of the datePublished field from this JSON snippet.

Install both libraries

pip install jq

Depending on your platform, pup can be installed in different ways. Easiest is to use go get or brew.

go get github.com/ericchiang/pup
OR
brew install pup

Alternatively, see instructions on the releases page.

If all is successful, these two commands should provide output:

jq --version
pup --version

Extracting Title

Remember that I want to do this via a command line script, so create a new file in /tmp called scraper.sh.

Make it executable:

chmod +x /tmp/scraper.sh

Paste the following content:

#!/bin/bash
curl $URL

Now set the URL value and call the scraper.sh:

URL=https://agardner.net/serverless-voting/ /tmp/scraper.sh

Notice that it prints the entire HTML content and we only want the <title> tag so modify scraper.sh to look like this:

#!/bin/bash
title=$(curl $URL | pup 'title')
echo $title

Here we’re piping the output of curl to the pup command and asking pup to print only the <title> tag.

The output should look like this:

% URL=https://agardner.net/serverless-voting/ /tmp/scraper.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24491  100 24491    0     0   254k      0 --:--:-- --:--:-- --:--:--  254k
<title> Serverless, Zero Database Voting System | Adam Gardner </title>

Let’s clean that up a bit. Modify the /tmp/scraper.sh file as such that we add the silent (-s) flag to the curl command. Adding this flag means we don’t see the download stats bar.

Then add the text{} modifier to the pup command. This tells pup that we only want to see the text of the tag, not the actual start and end tags themselves.

Your script should now look like this:

#!/bin/bash
title=$(curl -s $URL | pup 'title' text{})
echo $title

Re-run and you should see this output:

% URL=https://agardner.net/serverless-voting/ /tmp/scraper.sh
Serverless, Zero Database Voting System | Adam Gardner

Extract Date Published

Recall that there’s a block of JSON in the source code which contains the information we need:

<script type="application/ld+json">
    {
      "@type":"BlogPosting",
      "@context":"https://schema.org"
      "headline":"Serverless, Zero Database Voting System",
      "dateModified":"2020-06-14T00:00:00+00:00",
      "datePublished":"2020-06-14T00:00:00+00:00",
      "mainEntityOfPage":{
        "@type":"WebPage",
        "@id":"https://agardner.net/serverless-voting/"
      },
      "author":{
        "@type":"Person",
        "name":"Adam Gardner"
      },
      "url":"https://agardner.net/serverless-voting/",
      "description":"I needed a voting system for this website which was compatible with serverless pages. I also wanted it to be zero-login which ruled out using a third-party plugin. The result was a serverless, zero database &amp; zero login voting system using AWS. Here is how…"
    }
</script>

From this JavaScript snippet we need to extract the datePublished field:

"2020-06-14T00:00:00+00:00"

Adjust your scraper.sh file to look like this:

#!/bin/bash
page_html=$(curl -s $URL)
title=$(echo $page_html | pup 'title' text{})
echo $title

All we’ve done here is store the output of the curl command into a variable called page_html so we can manipulate and query the HTML without repeated curl calls to the website.

Use pup to extract the <script type="application/ld+json"> tag. Modify your scraper.sh again:

#!/bin/bash
page_html=$(curl -s $URL)
title=$(echo $page_html | pup 'title' text{})
date_published=$(echo $page_html | pup 'script[type="application/ld+json"] text{}')
echo $title
echo $date_published

This sort of works but it outputs the title then entire JSON object which isn’t quite what we want:

% URL=https://agardner.net/serverless-voting/ /tmp/scraper.sh
Serverless, Zero Database Voting System | Adam Gardner
{"@type":"BlogPosting","headline":"Serverless, Zero Database Voting System","dateModified":"2020-06-14T00:00:00+00:00","datePublished":"2020-06-14T00:00:00+00:00","mainEntityOfPage":{"@type":"WebPage","@id":"https://agardner.net/serverless-voting/"},"author":{"@type":"Person","name":"Adam Gardner"},"url":"https://agardner.net/serverless-voting/","description":"I needed a voting system for this website which was compatible with serverless pages. I also wanted it to be zero-login which ruled out using a third-party plugin. The result was a serverless, zero database &amp; zero login voting system using AWS. Here is how…","@context":"https://schema.org"}

We need to take this JSON output and pass it to jq. JQ is a JSON parser. It works like so:

some input text | jq 'some_desired_output'

The most basic would be to just ask for the entire input document back as output. The . character is the shorthand for this. Run the following in a terminal window

echo '{"foo": "bar"}' | jq '.'

You should see a pretty printed JSON object as output:

%  echo '{"foo": "bar"}' | jq '.'
{
  "foo": "bar"
}

After the . you can request any JSON field. So to get "bar" as the output, do this:

echo '{"foo": "bar"}' | jq '.foo'

You’ll see:

% echo '{"foo": "bar"}' | jq '.foo'
"bar"

We can use this concept in our script so after we’ve used pup to retrieve the <script> tag, we will use jq to retrieve only the datePublished JSON field.

Modify scraper.sh as follows:

#!/bin/bash
page_html=$(curl -s $URL)
title=$(echo $page_html | pup 'title' text{})
date_published=$(echo $page_html | pup 'script[type="application/ld+json"] text{}' | jq '.datePublished')
echo $title
echo $date_published

So we’re echoing the page_html content then using pup to extract just the <script type="application/ld+json"> block. Then we’re passing that extracted value to jq and asking for the datePublished value. Which gives us:

RL=https://agardner.net/serverless-voting/ /tmp/scraper.sh
Serverless, Zero Database Voting System | Adam Gardner
"2020-06-14T00:00:00+00:00"

One final (optional) cleanup step would be to remove the quotation marks. That’s as easy as adding the -r flag to jq. As jq --help suggests:

-r    output raw strings, not JSON texts;

So:

#!/bin/bash
page_html=$(curl -s $URL)
title=$(echo $page_html | pup 'title' text{})
date_published=$(echo $page_html | pup 'script[type="application/ld+json"] text{}' | jq -r '.datePublished')
echo $title
echo $date_published

Prints this:

URL=https://agardner.net/serverless-voting/ /tmp/scraper.sh
Serverless, Zero Database Voting System | Adam Gardner
2020-06-14T00:00:00+00:00