The Scrapelect Book

Welcome to the scrapelect book. scrapelect is a declarative web-scraping language, where you describe how to find data on a web page and how to filter and process that data, then get output in a structured, machine-readable format.

scrapelect is currently in development, and the language and interpreter are changing. This book aims to be up to date with the latest released version (currently v0.3.2). If something is inconsistent or incorrect, please consider submitting an issue or pull request to help improve the documentation.

Helpful links:

  • GitHub repository: contains the source code for scrapelect (and this book).
  • docs.rs: lists developer documentation for contributing to or extending the scrapelect interpreter, as well as user documentation for scrapelect's built-in filters.
  • GitHub issue tracker: the place to search and file issues to report bugs, request features, and ask questions.

Quick Start

Installation

Scrapelect requires the Rust toolchain to install with cargo. If you don't have it installed, you can use rustup. With rust and cargo installed, run

$ cargo install scrapelect

to install the scrapelect interpreter.

Your first scrp

A scrapelect program stored in a name.scrp file. Let's create and edit the file article.scrp:

title: .mw-page-title-main {
  content: $element | text();
};

headings: .mw-heading > * {
  content: $element | text();
}*;

This program describes the data on a web page by how to find it (by CSS selector), and what we want to do with it (get the text of the title and headings). A scrapelect program describes a certain web page, so this program works when the page's title is stored in an HTML element with the class "mw-page-title-main" and headings in elements with the class "mw-heading". In this case, this will let us scrape Wikipedia articles.

After saving the file to article.scrp, let's run it on the Wikipedia entry for "cat":

$ scrapelect article.scrp "https://en.wikipedia.org/wiki/Cat"
I got an error like `command not found: scrapelect`!

This means the scrapelect executable is not in your PATH. By default, cargo installs binaries to (on Linux) $HOME/.cargo/bin/scrapelect. Try adding the directory ~/.cargo/bin to your PATH if it is not already present.

The rustup book may have more information, or try searching "add cargo binaries to PATH" for your operating system.

Let's see the output for that scrp:

{
  "headings": [
    { "content": "Etymology and naming" },
    { "content": "Taxonomy" },
    { "content": "Evolution" },
    { "content": "Domestication" },
    { "content": "Characteristics" },
    { "content": "Size" },
    { "content": "Skeleton" },
    { "content": "Skull" },
    { "content": "Claws" },
    { "content": "Ambulation" },
    { "content": "Balance" },
    { "content": "Coats" },
    { "content": "Senses" },
    { "content": "Vision" },
    { "content": "Hearing" },
    { "content": "Smell" },
    { "content": "Taste" },
    { "content": "Whiskers" },
    { "content": "Behavior" },
    { "content": "Sociability" },
    { "content": "Communication" },
    { "content": "Grooming" },
    { "content": "Fighting" },
    { "content": "Hunting and feeding" },
    { "content": "Play" },
    { "content": "Reproduction" },
    { "content": "Lifespan and health" },
    { "content": "Disease" },
    { "content": "Ecology" },
    { "content": "Habitats" },
    { "content": "Ferality" },
    { "content": "Impact on wildlife" },
    { "content": "Interaction with humans" },
    { "content": "Shows" },
    { "content": "Infection" },
    { "content": "History and mythology" },
    { "content": "Superstitions and rituals" },
    { "content": "See also" },
    { "content": "Notes" },
    { "content": "References" },
    { "content": "External links" }
  ],
  "title": {
    "content": "Cat"
  }
}

We've collected the content in each heading in this article, as well as its title, with just that description. And it's easily parsable by other programs, too.


In the following chapters, we'll examine the language concepts and syntax so that you can create scrps like this one and obtain structured data from any web page.

Interpreter CLI

Currently, the command line interface of the scrapelect interpreter is very simple. The way to invoke the interpreter is:

$ scrapelect <scrp_path> <url>

where scrp_path is the path to the .scrp file, and url is a qualified, absolute URL. That is, url must start in a scheme (like https:// or file://).

Currently, scrapelect supports the following URL schemes:

  • http://, https://: a "typical" URL to a web page.
  • file://: read a file on the local device, without using the internet.

Language Concepts

A scrapelect program is a sequence of statements that describes how to turn a web page into structured data. This description includes where the data is located on the web page (elements and selectors) and how to process that data using filters to get the structured output that you desire.

scrapelect is in beta, so it's possible that changes to the language on the dev branch may not be reflected yet in this book. The information should be correct for the latest released version (v0.3.2). If the documentation is incorrect or could be improved, consider filing an issue or pull request, as scrapelect is an open source project.

Statements and Values

Statements

Statements are the basic building block of scrapelect programs. At its core, a statement is a binding name: value;. This means "store the value into the name name" at the current context in the program.

When the program finishes, the scrapelect interpreter will output structured data as defined by the statements in the program.

The simplest kind of statement

Take this short program, for example:

cat-says: "meow";

This program consists of one statement, which binds the name cat-says to the value "meow". When scrapelect runs this program, it will output the following JSON:

{
  "cat-says": "meow"
}

This is the core of a scrapelect program: assigning values to names to be outputted as a structure of data. But what exactly can a value be?

Values

In scrapelect, every value has a type. These types are currently

  • Int: an integer (such as -1, 0, or 48)
  • Float: a floating-point (decimal) number (such as 1.0, 0.6931, or -7.29)
  • String: a string of text characters, such as ("hello!", "meow", or "" (the empty string))
  • Bool: a Boolean value of true or false
  • List: an ordered collection of other values, such as ([1, "hello!", 1.0] or [] (the empty list))
  • Structure: a nested structure of values, where each value is bound to a String key (like { greeting: "hi there" } or {} (the empty structure))
  • Null: represents a value of nothing
  • Element: an HTML element that can be queried and scraped1

Constants

The simplest type of value is a constant. A constant is a type of value determined when writing the scrapelect program, not dependent on other variables or the state of execution. There are three ways to specify a constant:

  • A string constant, wrapped in double quotes, such as my-string: "hello!"
  • An integer constant as a number literal, such as one: 1;
  • A floating-point constant as a number literal with a decimal point, such as half: 0.5;

Currently it is not possible to have a list or structure constant, but this may be added in future versions of scrapelect.

Reading bindings

A value can also be created by reading the value of a previous binding, with $name, where name is the name of the previous binding.

Example

greeting: "hello!";
message: $greeting;

will output:

{
  "greeting": "hello!",
  "message": "hello!"
}

Note that scrapelect programs are executed in sequential order, so you can only read bindings that were defined above the statement that is using them. Also, only bindings in the current or more outer contexts can be accessed like this (which will be covered more in depth in the section on element contexts).

The next (and maybe most important) type of value is the Element which will let us read data from a web page, explained in the next section.

Shadowing

It is possible to have two statements bind values to the same name in the same context. This is called shadowing. Only the last (bottommost) statement with a given name will appear in the program output, but at any point, referencing $name will give the current most recently defined binding of that name.

Example

output: "Not me!";
output: "or me...";
// save $output at this point in time
snapshot: $output;
output: "I will be the final result!";

will output:

{
  "output": "I will be the final result!",
  "snapshot": "or me..."
}
1

Elements are only available inside an element context and are not outputted in the final result.

Elements and Selectors

An element is a special kind of Value that is used to scrape and get information about a part of a web page. An element represents an HTML element, such as <img src="https://cdn2.thecatapi.com/images/edq.jpg" /> or <h1>My title</h1> or many others. An element value is only valid within an element context, a block of statements where the special binding $element contains the element.

Selecting an element

In scrapelect, we identify an element by its CSS selector. This can be simple (e.g., the tag name: h1), or arbitrarily complex because selectors can be combined to identify any element. See the MDN CSS selector reference for a full guide on writing CSS selectors1.

Common selector patterns:

  • tag: Selects elements with the tag name tag: <tag></tag>
  • #id: Selects the element with ID id: <a id="id"></a>
  • .class: Selects elements with CSS class class: <x class="class" />
  • a#b.c.d: Combine selectors (without whitespace) to select an element with all of these properties: <a id="b" class="c d">...</a>
  • a#b .c.d: Combine selectors with whitespace to select an element inside a parent element: selecting the span in <a id="b"><span class="c d">...</span></a>
  • ...and many more. See the MDN reference for more.

Creating an element context

An element context is a selector block with a list of statements that evaluates to a nested structure when interpreted. Inside the block, the binding $element provides access to the element specified by the selector.

Example

On the following fragment:

<a>Not special</a>
<a id="special"">Special</a>
special: #special {
  text: $element | text();
};

will output:

{
  "special": {
    "text": "Special"
  }
}

Notice how all bindings in the element context are evaluated into a nested structure stored in special. (Note: $element | text() is calling the text filter on $element, which will be explained in the filters chapter, but it means "get the text inside the element $element").

Nested contexts

Inside an element context, there are statements. And statements can bind names to element contexts. Thus, it is valid (and often useful) to have a nested element context. Inside a parent element context, a child element block's selector starts selecting elements inside the parent element: thus, in the following example, calico will be selected, but not shi tzu:

<ul id="cats">
    <li>calico</li>
    <!-- ... -->
</ul>
<ul id="dogs">
    <li>shi tzu</li>
    <!-- ... -->
</ul>
cat: #cats {
  type: li {
    content: $element | text();
  }
}

will output

{
  "cat": {
    "type": {
      "content": "calico"
    }
  }
}

Scope

Inside an element context block, it is possible to read bindings from an outer context if they are declared above the current statement. However, if an item exists in a more inner context, it will shadow the outer one.

Example

context: "outer";
outer: "outer";

parent: parent {
  context: "middle";
  child: child {
    context: $context;
    outer: $outer;
  };
};

outputs

{
  "context": "outer",
  "outer": "outer",
  "parent": {
    "child": {
      "context": "middle",
      "outer": "outer"
    },
    "context": "middle"
  }
}

Note that it is not directly possible to read bindings declared in a context more inner than the current, even if the block is above the current statement, since an element context block is evaluated into a structure. However, with filters like take, it is possible to read this data, just not by binding name syntax $name.

Element lifetime

It is possible to rebind the value contained in $element. However, because an element is only valid inside the element context, these will not be returned in the final output. In fact, any bindings that contain $element at the close of an element block will be omitted from the returned structure.

Example

child: a {
  this: $element;
};
unexpected: $child | take(key: "this");

will output, where child | take(key: "this") means "return the value with key "this" in the child structure, and return null if it is not present:

{
  "child": {},
  "unexpected": null
}

Note that child is an empty structure, even though it bound this to $element.

Selecting multiple elements: qualifiers

By default, an element block will only select the first element that matches a selector, and raise an error if it is not found. However, it is often useful to select all the elements that match a selector, or select one optional element, not raising an error if it does not exist. We can specify how many elements to expect with qualifiers. A qualifier is placed at the end of an element context block, and can be one of:

  • `` (no qualifier): the default, so select the first element that matches this selector, and raises an error if there are none
  • ? (optional): similarly, selects the first element matching the selector, but the element context evaluates to null instead of erroring if there is no element matching that selector
  • * (all): select all elements matching this selector, evaluate the element block for each one, and place the results in a List

Examples

Take the document fragment:

<li>1</li>
<li class="even"">2</li>
<li>3</li>
<li class="even">4</li>

Given the scrp:

// no qualifier (first)
first_num: li {
  text: $element | text();
};

// * qualifier (all)
numbers: li {
  text: $element | text();
}*;

// ? qualifier (optional)
optional: #not-here {
  text: $element | text();
}?;

will output:

{
  "first_num": { "text": "1" },
  "numbers": [
    { "text": "1" },
    { "text": "2" },
    { "text": "3" },
    { "text": "4" }
  ],
  "optional": null
}
1

Certain CSS features, like pseudoclasses and attribute selectors, are not currently supported in scrapelect.

Filters

Now we've seen how to identify parts of a page to turn into data, let's look at how to manipulate that data. scrapelect does this using filters. We've seen a couple already, like take and text, but let's look at them more closely.

Every filter takes in a value, and a list of named arguments (which can be empty), and returns the result of applying that filter.

We call a filter with value | filter_name(arg: value, ...). The simplest filter is the id filter, which takes the value and returns that same value: if we let a: 5 | id();, we get { "a": 5 }.

Another useful filter is the dbg filter, which, like id, returns the original value, but it prints the value to the terminal as well. dbg has an optional argument msg, a String, which specifies a message to include, if set. If not provided to the filter, it prints debug message: ....

a: 1 | dbg();
b: 2 | dbg(message: "from b");

will output to the console:

debug message: 1
from b: 2

and return { "a": 1, "b": 2 }.

Modifying filters

Often, though, it is useful to use a filter that modifies the passed value in some way. One useful filter is strip, which trims leading and trailing whitespace from a string, which is often found inside HTML elements.

trimmed: "    hellooooo   " | strip();

outputs

{
  "trimmed": "hellooooo"
}

Note that this doesn't mutate the original value passed in; that is, if you read another binding as the input to a filter, applying the filter will not change the value in the original binding:

bind: "5";
new: $bind | int();

results (where int converts the value into the integer type):

{
  "bind": "5",
  "new": 5
}

where bind is still a string but new is an int.

Filter documentation

Documentation for all of the built-in filters is available at docs.rs, which lists filter signatures, descriptions, and examples.

Chaining filters

Filters are designed to be executed in a pipeline, passing the output of one filter to the input of another:

is-not-five: "5" | int() | eq(to: 5) | not();

outputs

{
  "is-not-five": false
}

where eq returns whether the value is equal to to, and not returns the boolean NOT (opposite) of the input value.

Qualifiers

Similar to element blocks, filters can have qualifiers at the end. The same qualifiers (none for one, ? for optional, and * for list) can be placed at the end of a filter call to modify its behavior:

  • value | filter(...)? applies the filter to value if it is not null, or returns null if it is.
  • value | filter(...)*, when value is a list, returns the list of every element in the list with the filter applied to it (similar to the map operation in other languages). It is an error to use the * qualifier if the input value is not a list.

Example

floats: "1 2.3 4.5" | split() | float()*;
optional: "3.4" | float()?;
optional2: null | float()?;

returns, where split turns a string into an array of strings split on a delimiter (by default, whitespace), and float turns a value into the float type:

{
  "floats": [1.0, 2.3, 4.5],
  "optional": 3.4,
  "optional2": null
}

Advanced Features

scrapelect also contains features that make the language more expressive for selecting and manipulating data. Note that these can also increase the complexity of the program if used in excess, so it's recommended to only use these features as needed.

URL Recursion

Sometimes, a page contains links to another subpage, and it's necessary to follow that link to obtain the desired data. With scrapelect's URL recursion, it's possible to capture this pattern and select elements from a linked page:

Let's take the following pages as an example:

https://your-url.com/index.html

<!DOCTYPE html>
<html>
    <!-- ... -->
    <body>
        <p id="story">
            There once lived a great animal, which was great and also an animal.
        </p>
        <a id="next" href="page2.html">Continue</a>
    </body>
</html>

https://your-url.com/page2.html

<!DOCTYPE html>
<html>
    <!-- ... -->
    <body>
        <p id="story">
            This animal, which was great, was a great animal. The end.
        </p>
    </body>
</html>

Let's say we want to get both chapters of this lovely book. With URL recursion, we can!

next-page-link: #next {
  link: $element | take(key: "href");
} | take(key: "link");

page-1: #story {
  content: $element | text();
};

page-2: <$next-page-link> #story {
  content: $element | text();
};

By specifying the URL before the selector in the page-2 element block, we tell the scrapelect interpreter to read the page from the URL stored in next-page-link and select #story from that document. Thus, this will output:

{
  "next-page-link": "page2.html",
  "page-1":  "There once lived a great animal, which was great and also an animal.",
  "page-2":  "This animal, which was great, was a great animal. The end."
}

Both relative URLs (like page-2.html and /from-page-root.html) are supported, as well as absolute URLs (like https://your-url.com/page1.html).

Note that the URL to recurse on is actually an inline value (more in the next section), so it is valid to have a filter chain, and the URL that scrapelect will use is the result of the filter pipeline. The final type of the value must be a String (recursion over lists of strings is not currently supported).

Inline values

Like above, an inline value is a value and filter chain enclosed in diamond brackets: <value | filter() | filter() | ...>, and can be used in most places where a value is expected (filter arguments, URL recursion; not supported in a value: <inline> expression because the diamond brackets are superfluous). It is equivalent to writing intermediate: (inline-contents); and then using $intermediate in place of the inline. The difference, though, is that inline evaluations are not returned in the final output of a block.

Example

result: 5 | is_in(list: <"1 2 3 4 5" | split() | int()*>);

prints

{
  "result": true
}

and its equivalent

intermediate: "1 2 3 4 5" | split() | int()*; // [1, 2, 3, 4, 5]
result: 5 | is_in(list: $intermediate);

prints

{
  "intermediate": [1, 2, 3, 4, 5],
  "result": true
}

It is often more expressive to not use inline values, and is more efficient when you need to use the same calculation multiple times. However, inlines are useful to hide intermediate evaluations that are only used once.

Additionally, note that it is not valid to start an element context inside an inline value. If you need to do this, create an intermediate binding.

Select filters

When you have a list, it is often useful to filter it so that it only contains elements that have some property. This is not possible to express with the * qualifier and filters alone, but scrapelect has a special kind of filter: the select filter.

The syntax of this filter is list | [name: value (| filters() | ...)]. name is any identifier (usually item), where the scrapelect interpreter will provide $name as each item in the list while evaluating the value pipeline. The final result of this pipeline must be a Bool, and determines whether to keep the item in the output list: if it is true, it is returned, if false, it is discarded.

That may be a little abstract, so let's see an example:

// ["me", "my", "oh", "my"]
list: "me my oh my" | split();
m-words: "me my myself mother mom meow" | split();
// select all items that are equal to "oh"
oh: $list | [ item: $item | eq(to: "oh") ];
// select all items that are in our list of m words
only-ms: $list | [ item: $item | is_in(list: $m-words) ];
nothing: $list | [ item: $item | eq(to: "wow") ];

will output:

{
  "list": ["me", "my", "oh", "my"],
  "m-words": ["me", "my", "myself", "mother", "mom", "meow"],
  "oh": ["oh"],
  "only-ms": ["me", "my", "my"],
  "nothing": []
}

The order of the original items is preserved. Note that the result of the $item filter chain must be a Bool; it may be helpful to use the truthy filter to convert to a boolean.

Extending Scrapelect

Note: plugin/dylib loading is not yet implemented (see #32 for tracking). This documents the process of writing a filter, which is also applicable for creating new builtin filters

Writing a new filter

The easiest way to write a new filter is with the #[filter_fn]. attribute macro.

#![allow(unused)]
fn main() {
/// Signature: value: ValueT | filter_name(arg: ArgT, ...): ReturnT
///
/// Description of what this filter does.
///
/// # Examples
///
/// It's helpful to include a list of examples here, with their outputs/effects.
pub fn filter_name<'ast, 'doc, E: ElementContextView<'ast, 'doc> + ?Sized>(
    value: RustValueT,
    ctx: &mut E, // this can be omitted if you don't need it
    arg: RustArgT,
    ...
) -> Result<PValue<'doc>> {
    todo!()
}
}

The Rust*Ts must implement TryFromValue<Pipeline> which allows type validation and automatic Value unwrapping. The return value must be rewrapped into Value::Type.

With the #[filter_fn] proc_macro, this function will be transformed into a fn() -> impl FilterDyn, which is the object-safe trait that represents a filter call.

Registering a filter

TODO: this is not implemented because there is no dynamic loading.

To add a built-in filter in the scrapelect crate itself, add it to the build_map! macro in the interpreter::filter::builtin module.

implementing Filter manually

Filter is the non-object-safe trait that has typed Value and Args types. Its inherent function, Filter::apply, takes a Self::Value, Self::Args and &mut impl ElementContextView<'_, '_> and returns a Result<PValue>. Often, deriving the Args trait is sufficient to specify arguments, but for finer-grained control, you can implement Args manually, which tries to deserialize Self from a BTreeMap<&str, EValue>. If you need more expressivity in arguments (e.g., for variadic functions), you may have to implement this trait manually.

Implementing FilterDyn manually

All Filters implement the FilterDyn trait, which is the object-safe trait used for dynamic filter dispatch. It is often enough to not need to manually implement FilterDyn, but it may sometimes be necessary. Because FilterDyn takes an &self, it is possible to have filter state, but consider deeply whether this is truly necessary, as filters can be called from anywhere, so you must reason the soundness of your filter state.

All FilterDyns registed with scrapelect's filter dispatch must also be Send, Sync, and 'static.

Contributing

scrapelect is an open-source project, and we're so excited that you're interested in contributing! Development happens on GitHub, where we use issues to track bugs and feature requests, discussions for help and discussions, and pull requests for code and documentation contributions and review.

Reporting a bug

Please create a GitHub issue that contains the scrapelect program, relevant fragments of the input web page, and error messages, if they exist.

Contributing code changes

If you are adding a feature, consider discussing it on a GitHub feature request issue or discussion before opening a pull request, to develop the idea and see if there is community desire.

When you open a pull request, for a feature addition or bug fix, make sure to lint your code with

$ cargo clippy -- --deny clippy::all --warn clippy::pedantic

as this will run in CI, and will block your PR from being merged on failure.

Additionally, make sure to format your code with cargo fmt, and make sure all tests pass with cargo test.

Adding a test

When you add a feature, it's also important to add tests for that. If it's an addition to the language, create at least one example input/scrp pair in the examples directory, and add it to the integration_test! macro in src/interpreter/mod.rs.

We use insta for snapshot testing, so run the test with cargo t, and it will fail at first because there is no baseline to compare it to. Run cargo insta revie (you may have to cargo install cargo-insta), and when the output looks correct, accept the snapshot, and make sure to check the examples/scrps/*.snap into git.

Writing a new built-in filter.

See the section on writing a new filter in the extending scrapelect chapter. To add a new builtin filter, add it to src/interpreter/filter/builtin.rs, make sure to add documentation and examples in a doc-comment, and add the filter name to the build_map! macro at the bottom of the file. It is very helpful to add an integration test that shows how this filter should work, see the section above for more.

Enhancing this book

This book is also developed in the scrapelect repo, and you can contribute to it without having to write any code. The text of the book is in the doc/src/ folder, and you can edit each chapter.md folder to enhance the documentation and submit it as a pull request.

While you are developing this, you can use mdbook serve --open to view a local copy of the book that will update with your changes (you may have to cargo install mdbook).