Writing a Custom Rasa Entity Extractor for Regular Expressions

2 min read
Writing a Custom Rasa Entity Extractor for Regular Expressions

I recently wrote a library that contains custom Rasa components. I named the project rasam, short for "Rasa Improved".

Because Rasa currently does not have an accurate Entity Extractor based on Regular Expressions, I wrote one based on @naoko's code. You may find my source code here but I will explain how it works.

First, Rasa follows a Component Lifecycle. Each components will be fed with training data during training and they need to save a model that they generated. In case of Regular Expressions, we don't need to generate or save any data aside from the regular expressions themselves.  

Training

During training, we only need to make a copy of the regular expressions.

    def train(self, training_data: TrainingData, config: Optional[RasaNLUModelConfig] = None, **kwargs: Any,) -> None:
        self.regex_features = training_data.regex_features

Persist

Because the regular expressions can only be accessed during training, we need to save these into the model. A JSON file will be enough for these.

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        if self.regex_features:
            file_name = file_name + ".json"
            utils.write_json_to_file(os.path.join(model_dir, file_name), self.regex_features)
            return {"file": file_name}
        return {"file": None}

Loading

Before we can use our models in action, our component needs to load the regular expressions that we saved. I compiled the regular expressions in order to speed it up.

class RegexEntityExtractor(EntityExtractor):
    def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
        super().__init__(component_config)
        self.regex_features = []
        for regex_feature in component_config.get("regex_features", []):  # type: ignore
            regex_feature["compiled"] = re.compile(regex_feature["pattern"])
            self.regex_features.append(regex_feature)

    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional[Metadata] = None,
        cached_component: Optional[Component] = None,
        **kwargs: Any,
    ) -> "EntityExtractor":
        file_name = meta.get("file")
        regex_features = []
        if file_name:
            regex_features = utils.read_json_file(os.path.join(model_dir, file_name))  # type: ignore
        meta["regex_features"] = regex_features
        return cls(meta)

Processing of message

The process method is the actual application of our component. Rasa calls this method and passes the Message object containing the text from the user. In our code we made sure that each match is unique by using a set and then rearranging the matches that we found in descending order of confidence. This is because there will be entities extracted by preceding components in the Rasa pipeline.

    def process(self, message: Message, **kwargs: Any) -> None:
        matches = set()
        for regex_feature in self.regex_features:
            for match in regex_feature["compiled"].finditer(message.text):
                matches.add(
                    tuple(
                        {
                            "start": match.start(),
                            "end": match.end(),
                            "value": match.group(0),
                            "entity": regex_feature["name"],
                            "extractor": self.name,
                            "confidence": 1.0,
                        }.items()
                    )
                )
        entities = message.get("entities", []) + list(map(dict, matches))  # type: ignore

        message.set(
            "entities", sorted(entities, key=lambda x: x.get("confidence", 0), reverse=True), add_to_output=True,
        )

Convenience

rasam provides custom components that can be easily used in your Rasa projects.

To install rasam, just use pip.

pip install rasam

And in your Rasa config.yml, use the RegexEntityExtractor like below:

pipeline:
  - name: rasam.RegexEntityExtractor

You may check the Github repository here and give it a star!

Read more articles like this in the future by buying me a coffee!

Buy me a coffeeBuy me a coffee