Introducing "dude" - A web scraping framework inspired by Flask syntax

Dude is an easy-to-learn web scraping framework inspired by Flask syntax. You can build a web scraper in just a few lines of code.

2 min read
Introducing "dude" - A web scraping framework inspired by Flask syntax
Dude is available on Github: https://github.com/roniemartinez/dude

Yesterday, I open-sourced Dude and released it on PyPI. Dude was inspired by Flask syntax allowing Python developers to write a web scraper in in just a few lines of code.

Like Flask, Dude has an easy-to-learn syntax. You can build a web scraper with just a single decorated function.

Here is the minimal example that can scrape all the links/URLs from a website.

from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

Features

The first version 0.1.0 comes with the following features:

  • Simple Flask-inspired design - build a scraper with decorators.
  • Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
  • Data grouping - group related scraping data.
  • URL pattern matching - run functions on specific URLs.
  • Priority - reorder functions based on priority.
  • Setup function - enable setup steps (clicking dialogs or login).
  • Navigate function - enable navigation steps to move to other pages.
  • Custom storage - option to save data to other formats or database.
  • Async support - write async handlers.

Installation

Dude is available on PyPI and can be easily installed using pip.

pip install pydude

In addition, you will need to install the browser binaries for Playwright, also from command line.

playwright install

Basic Usage

As a demo, simply copy the minimal example to a file named example.py.

# example.py
from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

To start scraping, run the following from terminal:

dude scrape --url "<replace-with-url>" example.py --output output.json

This should create a file named output.json containing your scraped data. Just an example, running this command against this website will result in the following.

[
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 0,
    "url": "https://ron.sh"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 1,
    "url": "https://ron.sh/"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 2,
    "url": "https://ron.sh/about/"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 3,
    "url": "https://ron.sh/privacy-policy/"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 4,
    "url": "https://ron.sh/terms-of-service/"
  },
  // ...more output here
 ]

Advanced Usage

Details on several advanced features and how to use them can be found in the Github repository: https://github.com/roniemartinez/dude#advanced-usage

Final Thoughts

Dude is at a very early stage. I encourage everyone interested in web scraping to contribute by opening bug reports, feature requests, pull requests and discussions. Your help is highly appreciated!

Read more articles like this in the future by buying me a coffee!

Buy me a coffeeBuy me a coffee