The Scrapelect Book
Welcome to the scrapelect
book. scrapelect
is a declarative
web-scraping language, where you describe how to find data on
a web page and how to filter and process that data, then get output
in a structured, machine-readable format.
scrapelect
is currently in development, and the language and
interpreter are changing. This book aims to be up to date with
the latest released version (currently v0.3.2). If something
is inconsistent or incorrect, please consider submitting an
issue
or pull request to help improve the documentation.
Helpful links:
- GitHub repository:
contains the source code for
scrapelect
(and this book). - docs.rs:
lists developer documentation for contributing to or extending
the
scrapelect
interpreter, as well as user documentation forscrapelect
's built-in filters. - GitHub issue tracker: the place to search and file issues to report bugs, request features, and ask questions.
Quick Start
Installation
Scrapelect requires the Rust toolchain to install with cargo
.
If you don't have it installed, you can use rustup
.
With rust and cargo
installed, run
$ cargo install scrapelect
to install the scrapelect
interpreter.
Your first scrp
A scrapelect
program stored in a name.scrp
file. Let's create and edit the file
article.scrp
:
title: .mw-page-title-main {
content: $element | text();
};
headings: .mw-heading > * {
content: $element | text();
}*;
This program describes the data on a web page by how to find it (by CSS selector),
and what we want to do with it (get the text of the title and headings). A scrapelect
program describes a certain web page, so this program works when the page's title is
stored in an HTML element with the class "mw-page-title-main"
and headings in elements
with the class "mw-heading"
. In this case, this will let us scrape Wikipedia articles.
After saving the file to article.scrp
, let's run it on the Wikipedia entry for "cat":
$ scrapelect article.scrp "https://en.wikipedia.org/wiki/Cat"
I got an error like `command not found: scrapelect`!
This means the scrapelect
executable is not in your PATH
. By default,
cargo
installs binaries to (on Linux) $HOME/.cargo/bin/scrapelect
.
Try adding the directory ~/.cargo/bin
to your PATH if it is not already present.
The rustup
book
may have more information, or try searching "add cargo binaries to PATH
" for your
operating system.
Let's see the output for that scrp
:
{
"headings": [
{ "content": "Etymology and naming" },
{ "content": "Taxonomy" },
{ "content": "Evolution" },
{ "content": "Domestication" },
{ "content": "Characteristics" },
{ "content": "Size" },
{ "content": "Skeleton" },
{ "content": "Skull" },
{ "content": "Claws" },
{ "content": "Ambulation" },
{ "content": "Balance" },
{ "content": "Coats" },
{ "content": "Senses" },
{ "content": "Vision" },
{ "content": "Hearing" },
{ "content": "Smell" },
{ "content": "Taste" },
{ "content": "Whiskers" },
{ "content": "Behavior" },
{ "content": "Sociability" },
{ "content": "Communication" },
{ "content": "Grooming" },
{ "content": "Fighting" },
{ "content": "Hunting and feeding" },
{ "content": "Play" },
{ "content": "Reproduction" },
{ "content": "Lifespan and health" },
{ "content": "Disease" },
{ "content": "Ecology" },
{ "content": "Habitats" },
{ "content": "Ferality" },
{ "content": "Impact on wildlife" },
{ "content": "Interaction with humans" },
{ "content": "Shows" },
{ "content": "Infection" },
{ "content": "History and mythology" },
{ "content": "Superstitions and rituals" },
{ "content": "See also" },
{ "content": "Notes" },
{ "content": "References" },
{ "content": "External links" }
],
"title": {
"content": "Cat"
}
}
We've collected the content in each heading in this article, as well as its title, with just that description. And it's easily parsable by other programs, too.
In the following chapters, we'll examine the language concepts and syntax so that
you can create scrp
s like this one and obtain structured data from any web page.
Interpreter CLI
Currently, the command line interface of the scrapelect
interpreter is very
simple. The way to invoke the interpreter is:
$ scrapelect <scrp_path> <url>
where scrp_path
is the path to the .scrp
file, and url
is a qualified,
absolute URL. That is, url
must start in a scheme (like https://
or file://
).
Currently, scrapelect
supports the following URL schemes:
http://
,https://
: a "typical" URL to a web page.file://
: read a file on the local device, without using the internet.
Language Concepts
A scrapelect
program is a sequence of statements
that describes how to turn a web page into structured data. This description
includes where the data is located on the web page (elements and selectors)
and how to process that data using filters to get the structured
output that you desire.
scrapelect
is in beta, so it's possible that changes to the language on the dev
branch may not be reflected yet in this book. The information should be correct
for the latest released version (v0.3.2). If the documentation is incorrect or
could be improved, consider filing an issue
or pull request, as scrapelect
is an open source project.
Statements and Values
Statements
Statements are the basic building block of scrapelect
programs. At its core, a statement is a binding name: value;
.
This means "store the value
into the name name
" at the
current context in the program.
When the program finishes, the scrapelect
interpreter will output
structured data as defined by the statements in the program.
The simplest kind of statement
Take this short program, for example:
cat-says: "meow";
This program consists of one statement, which binds the
name cat-says
to the value "meow"
. When scrapelect
runs
this program, it will output the following JSON:
{
"cat-says": "meow"
}
This is the core of a scrapelect
program: assigning values
to names to be outputted as a structure of data. But what
exactly can a value be?
Values
In scrapelect
, every value has a type. These types are currently
Int
: an integer (such as -1, 0, or 48)Float
: a floating-point (decimal) number (such as 1.0, 0.6931, or -7.29)String
: a string of text characters, such as ("hello!", "meow", or "" (the empty string))Bool
: a Boolean value of true or falseList
: an ordered collection of other values, such as ([1, "hello!", 1.0]
or[]
(the empty list))Structure
: a nested structure of values, where each value is bound to a String key (like{ greeting: "hi there" }
or{}
(the empty structure))Null
: represents a value of nothingElement
: an HTML element that can be queried and scraped1
Constants
The simplest type of value is a constant. A constant is a type of value
determined when writing the scrapelect
program, not dependent on
other variables or the state of execution. There are three ways
to specify a constant:
- A string constant, wrapped in double quotes, such as
my-string: "hello!"
- An integer constant as a number literal, such as
one: 1;
- A floating-point constant as a number literal with a decimal
point, such as
half: 0.5;
Currently it is not possible to have a list or structure constant,
but this may be added in future versions of scrapelect
.
Reading bindings
A value can also be created by reading the value of a previous
binding, with $name
, where name
is the name of the previous
binding.
Example
greeting: "hello!";
message: $greeting;
will output:
{
"greeting": "hello!",
"message": "hello!"
}
Note that scrapelect
programs are executed in sequential order, so
you can only read bindings that were defined above the statement
that is using them. Also, only bindings in the current or
more outer contexts can be accessed like this (which will be covered
more in depth in the section on element contexts).
The next (and maybe most important) type of value is the Element
which will let us read data from a web page, explained in the next section.
Shadowing
It is possible to have two statements bind values to the same name in the same context.
This is called shadowing. Only the last (bottommost) statement with a given
name will appear in the program output, but at any point, referencing $name
will
give the current most recently defined binding of that name.
Example
output: "Not me!";
output: "or me...";
// save $output at this point in time
snapshot: $output;
output: "I will be the final result!";
will output:
{
"output": "I will be the final result!",
"snapshot": "or me..."
}
Elements are only available inside an element context and are not outputted in the final result.
Elements and Selectors
An element is a special kind of Value that is used to scrape and get
information about a part of a web page. An element represents an HTML element,
such as <img src="https://cdn2.thecatapi.com/images/edq.jpg" />
or
<h1>My title</h1>
or many others. An element value is only valid within an
element context, a block of statements where the special binding
$element
contains the element.
Selecting an element
In scrapelect
, we identify an element by its CSS selector. This can be simple
(e.g., the tag name: h1
), or arbitrarily complex because selectors can be
combined to identify any element. See the MDN CSS selector reference
for a full guide on writing CSS selectors1.
Common selector patterns:
tag
: Selects elements with the tag nametag
:<tag></tag>
#id
: Selects the element with IDid
:<a id="id"></a>
.class
: Selects elements with CSS classclass
:<x class="class" />
a#b.c.d
: Combine selectors (without whitespace) to select an element with all of these properties:<a id="b" class="c d">...</a>
a#b .c.d
: Combine selectors with whitespace to select an element inside a parent element: selecting thespan
in<a id="b"><span class="c d">...</span></a>
- ...and many more. See the MDN reference for more.
Creating an element context
An element context is a selector block with a list of statements that evaluates
to a nested structure when interpreted. Inside the block, the binding $element
provides access to the element specified by the selector.
Example
On the following fragment:
<a>Not special</a>
<a id="special"">Special</a>
special: #special {
text: $element | text();
};
will output:
{
"special": {
"text": "Special"
}
}
Notice how all bindings in the element context are evaluated into a nested
structure stored in special
. (Note: $element | text()
is calling the text
filter on $element
, which will be explained in the filters
chapter, but it means "get the text inside the element $element
").
Nested contexts
Inside an element context, there are statements. And statements can bind names
to element contexts. Thus, it is valid (and often useful) to have a nested element
context. Inside a parent element context, a child element block's selector starts
selecting elements inside the parent element: thus, in the following example,
calico
will be selected, but not shi tzu
:
<ul id="cats">
<li>calico</li>
<!-- ... -->
</ul>
<ul id="dogs">
<li>shi tzu</li>
<!-- ... -->
</ul>
cat: #cats {
type: li {
content: $element | text();
}
}
will output
{
"cat": {
"type": {
"content": "calico"
}
}
}
Scope
Inside an element context block, it is possible to read bindings from an outer context if they are declared above the current statement. However, if an item exists in a more inner context, it will shadow the outer one.
Example
context: "outer";
outer: "outer";
parent: parent {
context: "middle";
child: child {
context: $context;
outer: $outer;
};
};
outputs
{
"context": "outer",
"outer": "outer",
"parent": {
"child": {
"context": "middle",
"outer": "outer"
},
"context": "middle"
}
}
Note that it is not directly possible to read bindings declared in a context more inner than the current,
even if the block is above the current statement, since an element context block is evaluated into a
structure. However, with filters like take
, it is possible to read this data, just not
by binding name syntax $name
.
Element lifetime
It is possible to rebind the value contained in $element
. However, because an element
is only valid inside the element context, these will not be returned in the final output.
In fact, any bindings that contain $element
at the close of an element block will be
omitted from the returned structure.
Example
child: a {
this: $element;
};
unexpected: $child | take(key: "this");
will output, where child | take(key: "this")
means "return the value with key "this"
in the child
structure, and return null
if it is not present:
{
"child": {},
"unexpected": null
}
Note that child
is an empty structure, even though it bound this
to $element
.
Selecting multiple elements: qualifiers
By default, an element block will only select the first element that matches a selector, and raise an error if it is not found. However, it is often useful to select all the elements that match a selector, or select one optional element, not raising an error if it does not exist. We can specify how many elements to expect with qualifiers. A qualifier is placed at the end of an element context block, and can be one of:
- `` (no qualifier): the default, so select the first element that matches this selector, and raises an error if there are none
?
(optional): similarly, selects the first element matching the selector, but the element context evaluates tonull
instead of erroring if there is no element matching that selector*
(all): select all elements matching this selector, evaluate the element block for each one, and place the results in aList
Examples
Take the document fragment:
<li>1</li>
<li class="even"">2</li>
<li>3</li>
<li class="even">4</li>
Given the scrp
:
// no qualifier (first)
first_num: li {
text: $element | text();
};
// * qualifier (all)
numbers: li {
text: $element | text();
}*;
// ? qualifier (optional)
optional: #not-here {
text: $element | text();
}?;
will output:
{
"first_num": { "text": "1" },
"numbers": [
{ "text": "1" },
{ "text": "2" },
{ "text": "3" },
{ "text": "4" }
],
"optional": null
}
Certain CSS features, like pseudoclasses and attribute
selectors, are not currently supported in scrapelect
.
Filters
Now we've seen how to identify parts of a page to turn into data,
let's look at how to manipulate that data. scrapelect
does this
using filters. We've seen a couple already, like take
and text
,
but let's look at them more closely.
Every filter takes in a value, and a list of named arguments (which can be empty), and returns the result of applying that filter.
We call a filter with value | filter_name(arg: value, ...)
. The simplest filter is the id
filter, which takes the value and
returns that same value: if we let a: 5 | id();
, we get
{ "a": 5 }
.
Another useful filter is the dbg
filter, which, like id
,
returns the original value, but it prints the value to the
terminal as well. dbg
has an optional argument msg, a
String, which specifies a message to include, if set. If not
provided to the filter, it prints debug message: ...
.
a: 1 | dbg();
b: 2 | dbg(message: "from b");
will output to the console:
debug message: 1
from b: 2
and return { "a": 1, "b": 2 }
.
Modifying filters
Often, though, it is useful to use a filter that modifies the passed
value in some way. One useful filter is strip
, which trims
leading and trailing whitespace from a string, which is often found
inside HTML elements.
trimmed: " hellooooo " | strip();
outputs
{
"trimmed": "hellooooo"
}
Note that this doesn't mutate the original value passed in; that is, if you read another binding as the input to a filter, applying the filter will not change the value in the original binding:
bind: "5";
new: $bind | int();
results (where int
converts the value into the integer type):
{
"bind": "5",
"new": 5
}
where bind
is still a string but new
is an int.
Filter documentation
Documentation for all of the built-in filters is available at docs.rs, which lists filter signatures, descriptions, and examples.
Chaining filters
Filters are designed to be executed in a pipeline, passing the output of one filter to the input of another:
is-not-five: "5" | int() | eq(to: 5) | not();
outputs
{
"is-not-five": false
}
where eq
returns whether the value is equal to to
, and not
returns the boolean NOT (opposite) of the input value.
Qualifiers
Similar to element blocks, filters can have qualifiers at the end.
The same qualifiers (none for one, ?
for optional, and *
for list)
can be placed at the end of a filter call to modify its behavior:
value | filter(...)?
applies the filter tovalue
if it is notnull
, or returnsnull
if it is.value | filter(...)*
, whenvalue
is a list, returns the list of every element in the list with the filter applied to it (similar to the map operation in other languages). It is an error to use the*
qualifier if the input value is not a list.
Example
floats: "1 2.3 4.5" | split() | float()*;
optional: "3.4" | float()?;
optional2: null | float()?;
returns, where split
turns a string into an array of strings
split on a delimiter (by default, whitespace), and float
turns
a value into the float type:
{
"floats": [1.0, 2.3, 4.5],
"optional": 3.4,
"optional2": null
}
Advanced Features
scrapelect
also contains features that make the language more expressive for
selecting and manipulating data. Note that these can also increase the complexity
of the program if used in excess, so it's recommended to only use these features
as needed.
URL Recursion
Sometimes, a page contains links to another subpage, and it's necessary to follow
that link to obtain the desired data. With scrapelect
's URL recursion, it's
possible to capture this pattern and select elements from a linked page:
Let's take the following pages as an example:
https://your-url.com/index.html
<!DOCTYPE html>
<html>
<!-- ... -->
<body>
<p id="story">
There once lived a great animal, which was great and also an animal.
</p>
<a id="next" href="page2.html">Continue</a>
</body>
</html>
https://your-url.com/page2.html
<!DOCTYPE html>
<html>
<!-- ... -->
<body>
<p id="story">
This animal, which was great, was a great animal. The end.
</p>
</body>
</html>
Let's say we want to get both chapters of this lovely book. With URL recursion, we can!
next-page-link: #next {
link: $element | take(key: "href");
} | take(key: "link");
page-1: #story {
content: $element | text();
};
page-2: <$next-page-link> #story {
content: $element | text();
};
By specifying the URL before the selector in the page-2
element block, we tell
the scrapelect
interpreter to read the page from the URL stored in next-page-link
and select #story
from that document. Thus, this will output:
{
"next-page-link": "page2.html",
"page-1": "There once lived a great animal, which was great and also an animal.",
"page-2": "This animal, which was great, was a great animal. The end."
}
Both relative URLs (like page-2.html
and /from-page-root.html
) are supported,
as well as absolute URLs (like https://your-url.com/page1.html
).
Note that the URL to recurse on is actually an inline value (more in the next
section), so it is valid to have a filter chain, and the URL that scrapelect
will
use is the result of the filter pipeline. The final type of the value must
be a String (recursion over lists of strings is not currently supported).
Inline values
Like above, an inline value is a value and filter chain enclosed in diamond
brackets: <value | filter() | filter() | ...>
, and can be used in most places
where a value is expected (filter arguments, URL recursion; not supported in a
value: <inline>
expression because the diamond brackets are superfluous). It is
equivalent to writing intermediate: (inline-contents);
and then using $intermediate
in place of the inline. The difference, though, is that inline evaluations are not
returned in the final output of a block.
Example
result: 5 | is_in(list: <"1 2 3 4 5" | split() | int()*>);
prints
{
"result": true
}
and its equivalent
intermediate: "1 2 3 4 5" | split() | int()*; // [1, 2, 3, 4, 5]
result: 5 | is_in(list: $intermediate);
prints
{
"intermediate": [1, 2, 3, 4, 5],
"result": true
}
It is often more expressive to not use inline values, and is more efficient when you need to use the same calculation multiple times. However, inlines are useful to hide intermediate evaluations that are only used once.
Additionally, note that it is not valid to start an element context inside an inline value. If you need to do this, create an intermediate binding.
Select filters
When you have a list, it is often useful to filter it so that it only contains
elements that have some property. This is not possible to express with the *
qualifier and filters alone, but scrapelect
has a special kind of filter: the
select filter.
The syntax of this filter is list | [name: value (| filters() | ...)]
. name
is any
identifier (usually item
), where the scrapelect
interpreter will provide $name
as each item in the list while evaluating the value
pipeline. The final result
of this pipeline must be a Bool, and determines whether to keep the item in the output
list: if it is true
, it is returned, if false
, it is discarded.
That may be a little abstract, so let's see an example:
// ["me", "my", "oh", "my"]
list: "me my oh my" | split();
m-words: "me my myself mother mom meow" | split();
// select all items that are equal to "oh"
oh: $list | [ item: $item | eq(to: "oh") ];
// select all items that are in our list of m words
only-ms: $list | [ item: $item | is_in(list: $m-words) ];
nothing: $list | [ item: $item | eq(to: "wow") ];
will output:
{
"list": ["me", "my", "oh", "my"],
"m-words": ["me", "my", "myself", "mother", "mom", "meow"],
"oh": ["oh"],
"only-ms": ["me", "my", "my"],
"nothing": []
}
The order of the original items is preserved. Note that the result of the $item
filter chain must be a Bool; it may be helpful to use the
truthy
filter to convert to a boolean.
Extending Scrapelect
Note: plugin/dylib loading is not yet implemented (see #32 for tracking). This documents the process of writing a filter, which is also applicable for creating new builtin filters
Writing a new filter
The easiest way to write a new filter is with the
#[filter_fn]
.
attribute macro.
#![allow(unused)] fn main() { /// Signature: value: ValueT | filter_name(arg: ArgT, ...): ReturnT /// /// Description of what this filter does. /// /// # Examples /// /// It's helpful to include a list of examples here, with their outputs/effects. pub fn filter_name<'ast, 'doc, E: ElementContextView<'ast, 'doc> + ?Sized>( value: RustValueT, ctx: &mut E, // this can be omitted if you don't need it arg: RustArgT, ... ) -> Result<PValue<'doc>> { todo!() } }
The Rust*T
s must implement
TryFromValue<Pipeline>
which allows type validation and automatic Value
unwrapping. The return value must be
rewrapped into Value::Type
.
With the #[filter_fn]
proc_macro, this function will be transformed into a
fn() -> impl FilterDyn
, which is the object-safe trait that represents a filter
call.
Registering a filter
TODO: this is not implemented because there is no dynamic loading.
To add a built-in filter in the scrapelect
crate itself, add it to the build_map!
macro in the interpreter::filter::builtin
module.
implementing Filter
manually
Filter
is the non-object-safe trait that has typed Value
and Args
types. Its inherent
function, Filter::apply
, takes a Self::Value
, Self::Args
and &mut impl ElementContextView<'_, '_>
and returns a Result<PValue>
. Often, deriving the Args
trait is sufficient to
specify arguments, but for finer-grained control, you can implement
Args
manually, which tries to deserialize Self
from a BTreeMap<&str, EValue>
. If you
need more expressivity in arguments (e.g., for variadic functions), you may have
to implement this trait manually.
Implementing FilterDyn
manually
All Filter
s implement the FilterDyn
trait, which is the object-safe trait used for dynamic filter dispatch. It is
often enough to not need to manually implement FilterDyn
, but it may
sometimes be necessary. Because FilterDyn
takes an &self
, it is possible to
have filter state, but consider deeply whether this is truly necessary, as filters
can be called from anywhere, so you must reason the soundness of your filter state.
All FilterDyn
s registed with scrapelect
's filter dispatch must also be Send
,
Sync
, and 'static
.
Contributing
scrapelect
is an open-source project, and we're so excited that you're interested
in contributing! Development happens on GitHub,
where we use issues to track bugs and feature requests, discussions for help and discussions,
and pull requests for code and documentation contributions and review.
Reporting a bug
Please create a GitHub issue
that contains the scrapelect
program, relevant fragments of the input web page,
and error messages, if they exist.
Contributing code changes
If you are adding a feature, consider discussing it on a GitHub feature request issue or discussion before opening a pull request, to develop the idea and see if there is community desire.
When you open a pull request, for a feature addition or bug fix, make sure to lint your code with
$ cargo clippy -- --deny clippy::all --warn clippy::pedantic
as this will run in CI, and will block your PR from being merged on failure.
Additionally, make sure to format your code with cargo fmt
, and make sure all
tests pass with cargo test
.
Adding a test
When you add a feature, it's also important to add tests for that. If it's an
addition to the language, create at least one example input/scrp pair in the
examples
directory, and add it to the integration_test!
macro in
src/interpreter/mod.rs
.
We use insta
for snapshot testing, so run the test with cargo t
, and it will
fail at first because there is no baseline to compare it to. Run cargo insta revie
(you may have to cargo install cargo-insta
), and when the output looks correct,
accept the snapshot, and make sure to check the examples/scrps/*.snap
into git.
Writing a new built-in filter.
See the section on writing a new filter in the extending scrapelect
chapter.
To add a new builtin filter, add it to src/interpreter/filter/builtin.rs
,
make sure to add documentation and examples in a doc-comment, and add the filter name
to the build_map!
macro at the bottom of the file. It is very helpful to add an
integration test that shows how this filter should work, see the section above for more.
Enhancing this book
This book is also developed in the scrapelect
repo, and you can contribute to it
without having to write any code. The text of the book is in the
doc/src/
folder,
and you can edit each chapter.md
folder to enhance the documentation and submit
it as a pull request.
While you are developing this, you can use mdbook serve --open
to view a local copy
of the book that will update with your changes (you may have to cargo install mdbook
).