Introducing "dude" - A web scraping framework inspired by Flask syntax

Dude is available on Github: https://github.com/roniemartinez/dude

Yesterday, I open-sourced Dude and released it on PyPI. Dude was inspired by Flask syntax allowing Python developers to write a web scraper in in just a few lines of code.

Like Flask, Dude has an easy-to-learn syntax. You can build a web scraper with just a single decorated function.

Here is the minimal example that can scrape all the links/URLs from a website.

from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

Features

The first version 0.1.0 comes with the following features:

Simple Flask-inspired design - build a scraper with decorators.
Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
Data grouping - group related scraping data.
URL pattern matching - run functions on specific URLs.
Priority - reorder functions based on priority.
Setup function - enable setup steps (clicking dialogs or login).
Navigate function - enable navigation steps to move to other pages.
Custom storage - option to save data to other formats or database.
Async support - write async handlers.

Installation

Dude is available on PyPI and can be easily installed using pip.

pip install pydude

In addition, you will need to install the browser binaries for Playwright, also from command line.

playwright install

Basic Usage

As a demo, simply copy the minimal example to a file named example.py.

# example.py
from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

To start scraping, run the following from terminal:

dude scrape --url "<replace-with-url>" example.py --output output.json

This should create a file named output.json containing your scraped data. Just an example, running this command against this website will result in the following.

[
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 0,
    "url": "https://ron.sh"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 1,
    "url": "https://ron.sh/"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 2,
    "url": "https://ron.sh/about/"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 3,
    "url": "https://ron.sh/privacy-policy/"
  },
  {
    "page_number": 1,
    "page_url": "https://ron.sh/",
    "group_id": 4492219664,
    "group_index": 0,
    "element_index": 4,
    "url": "https://ron.sh/terms-of-service/"
  },
  // ...more output here
 ]

Advanced Usage

Details on several advanced features and how to use them can be found in the Github repository: https://github.com/roniemartinez/dude#advanced-usage

Final Thoughts

Dude is at a very early stage. I encourage everyone interested in web scraping to contribute by opening bug reports, feature requests, pull requests and discussions. Your help is highly appreciated!

Introducing "dude" - A web scraping framework inspired by Flask syntax

Features

Installation

Basic Usage

Advanced Usage

Final Thoughts

Read more articles like this in the future by buying me a coffee!

Ronie Martinez

Simple Way to Use Ghost via Docker

Creating Real-Time Charts with FastAPI

Search suggestions

Features

Installation

Basic Usage

Advanced Usage

Final Thoughts

Read more articles like this in the future by buying me a coffee!

Ronie Martinez

Simple Way to Use Ghost via Docker

Creating Real-Time Charts with FastAPI

Related Articles

How to Submit a Package to PyPI

How to Submit a Package to PyPI - Part 2

Integrating Python and Windows on Travis

How to Test Your Python Code in OSX using Travis CI

Search suggestions