How to write a custom URL extractor for Rasa

1 min read
How to write a custom URL extractor for Rasa

Rasa has a list of components available that you can use. If you need to extract entities, chances are, people will advise you to use DucklingHTTPExtractor. This component uses Facebook's Duckling.

Rasa - 💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants.

Since Duckling is a separate service, it might be overkill to use it if you only need some simple named-entity extractions (NER). For example, if you just need to extract URLs in a message, running a separate service is not practical, particularly, if you are running your chatbot in the cheapest and less powerful EC2 server.

Fortunately, Rasa allows creating custom extractors. To do this, we need to subclass rasa.nlu.extractors.extractor.EntityExtractor. To extract URLs in a text, we can use the library URLExtract. As this custom extractor does not require prior training. We only need to implement the process method.

from typing import Any, Dict, Optional, Text

import urlextract
from rasa.nlu.extractors.extractor import EntityExtractor
from rasa.nlu.training_data import Message


class URLEntityExtractor(EntityExtractor):
    def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
        super().__init__(component_config)
        self.extractor = urlextract.URLExtract()

    def process(self, message: Message, **kwargs: Any) -> None:
        urls = set()
        last_pos = 0
        for url in self.extractor.gen_urls(message.text):
            start = message.text.find(url, last_pos)
            end = start + len(url)
            last_pos = end
            urls.add(
                tuple(
                    {
                        "start": start,
                        "end": end,
                        "value": url,
                        "entity": "URL",
                        "extractor": self.name,
                        "confidence": 1.0,
                    }.items()
                )
            )
        entities = message.get("entities", []) + list(
            sorted(map(dict, urls), key=lambda x: x.get("start"))  # type: ignore
        )

        message.set(
            "entities", sorted(entities, key=lambda x: x.get("confidence", 0), reverse=True), add_to_output=True
        )

I recently open-sourced the above code in a Python library rasam, short for Rasa Improved. To install rasam, run the following command:

pip install rasam

To use rasam in your Rasa project, add this to you config.yml:

pipeline:
  - name: rasam.URLEntityExtractor

Read more articles like this in the future by buying me a coffee!

Buy me a coffeeBuy me a coffee