Quick Start

Installation

Scrapelect requires the Rust toolchain to install with cargo. If you don't have it installed, you can use rustup. With rust and cargo installed, run

$ cargo install scrapelect

to install the scrapelect interpreter.

Your first scrp

A scrapelect program stored in a name.scrp file. Let's create and edit the file article.scrp:

title: .mw-page-title-main {
  content: $element | text();
};

headings: .mw-heading > * {
  content: $element | text();
}*;

This program describes the data on a web page by how to find it (by CSS selector), and what we want to do with it (get the text of the title and headings). A scrapelect program describes a certain web page, so this program works when the page's title is stored in an HTML element with the class "mw-page-title-main" and headings in elements with the class "mw-heading". In this case, this will let us scrape Wikipedia articles.

After saving the file to article.scrp, let's run it on the Wikipedia entry for "cat":

$ scrapelect article.scrp "https://en.wikipedia.org/wiki/Cat"
I got an error like `command not found: scrapelect`!

This means the scrapelect executable is not in your PATH. By default, cargo installs binaries to (on Linux) $HOME/.cargo/bin/scrapelect. Try adding the directory ~/.cargo/bin to your PATH if it is not already present.

The rustup book may have more information, or try searching "add cargo binaries to PATH" for your operating system.

Let's see the output for that scrp:

{
  "headings": [
    { "content": "Etymology and naming" },
    { "content": "Taxonomy" },
    { "content": "Evolution" },
    { "content": "Domestication" },
    { "content": "Characteristics" },
    { "content": "Size" },
    { "content": "Skeleton" },
    { "content": "Skull" },
    { "content": "Claws" },
    { "content": "Ambulation" },
    { "content": "Balance" },
    { "content": "Coats" },
    { "content": "Senses" },
    { "content": "Vision" },
    { "content": "Hearing" },
    { "content": "Smell" },
    { "content": "Taste" },
    { "content": "Whiskers" },
    { "content": "Behavior" },
    { "content": "Sociability" },
    { "content": "Communication" },
    { "content": "Grooming" },
    { "content": "Fighting" },
    { "content": "Hunting and feeding" },
    { "content": "Play" },
    { "content": "Reproduction" },
    { "content": "Lifespan and health" },
    { "content": "Disease" },
    { "content": "Ecology" },
    { "content": "Habitats" },
    { "content": "Ferality" },
    { "content": "Impact on wildlife" },
    { "content": "Interaction with humans" },
    { "content": "Shows" },
    { "content": "Infection" },
    { "content": "History and mythology" },
    { "content": "Superstitions and rituals" },
    { "content": "See also" },
    { "content": "Notes" },
    { "content": "References" },
    { "content": "External links" }
  ],
  "title": {
    "content": "Cat"
  }
}

We've collected the content in each heading in this article, as well as its title, with just that description. And it's easily parsable by other programs, too.


In the following chapters, we'll examine the language concepts and syntax so that you can create scrps like this one and obtain structured data from any web page.