Python Dictcomp Pipelines in Examples

Author:Murphy  |  View: 29193  |  Time: 2025-03-23 18:51:53

PYTHON PROGRAMMING

Pipelines process tasks one after another. Photo by Daniel Schludi on Unsplash

This article is motivated by a task I contributed to in a real life project a couple of years ago. After proposing the concept of comprehension pipelines, I noticed the solution could be nicely implemented using a dictcomp pipeline, with additional help of the OptionalBool data structure I proposed in yet another article.

This article aims to show you how we can implement such a pipeline. I will go into some of the details, so that the code becomes clear and convincing. You can consider it a case study showing the implementation of a dictcomp pipeline.


We have already discussed the power of generator pipelines in Python:

Building Generator Pipelines in Python

Later on, I proposed a Python-specific concept of comprehension pipelines:

Building Comprehension Pipelines in Python

Comprehension pipelines constitute the generalization of generator pipelines. They are similar to generator pipelines, but while the former create a generator with results, the latter output results in a form of any type of comprehension:

  • a generator, from a generator pipeline
  • a list, from a listcomp pipeline
  • a dictionary, from a dictcomp pipeline
  • a set, from a setcomp pipeline

In the above article, I focused on presenting how such pipelines work and how to construct them. When doing so, I used a bit simplistic examples. Here, we will use a more sophisticated example, in which we will use a dictcomp pipeline.

The task: Document filtering

Imagine your company has a large number of Standard Operating Procedures (SOPs). They are in a mess, organized using an old and out-of-date system. You need to filter them using a particular key. For instance, you need to find which of them include the word "Python".

Surely, such filtering can be far more complex, so you're expected to write a program that will enable one to change the filtering logic in a close future. What's more, for the moment the standards are stored as local file sin a shared drive into which you have access from your local machine. However, this should change soon, so you need to make the reading logic easy to change, too.

As mentioned above, out filtering task is also simple. The next task, however, can require more advanced parsing logic, like of particular fields or tables and the like. Again, the general mechanism will be similar, and we would simply need to rewrite the parser. You should take this into account in the implementation.

Implementation

We can describe the algorithm for the task as follows:

  • We have a list (or another sequence) of documents; each document from documents can be read to a string (which depends on the type of documents). In our first case, this will be a list of paths to the files. But the documents can be accessed in another way, for example, from a database.
  • After reading a document, parse the text and apply the filtering logic. In our example, this boils down to checking if the text contains the word "Python". If it does, return True; otherwise, return False.
  • Do the above step for all the documents from documents.
  • As output, return a dictionary with string representation of documents as keys (paths as strings in our example) and these Boolean values as the corresponding values.

The code block below shows the contents of a dictcomp_pipeline module. It implements the above logic in a general way, as described above.

# dictcomp_pipeline.py

from pathlib import Path
from collections.abc import Sequence

from typing import Any, Optional

# Type aliases
Paths = Sequence[Path]
KeywordArgs = Optional[dict[str, Any]]

def read_text(path: Path) -> str:
    """Read text from path and return a string."""
    return path.read_text()

def parse_text(text: str, word: str) -> bool:
    """Parse text from string and return a bool value."""
    return word.lower() in text.lower()

def run_dictcomp_pipeline(
    documents: Any,
    read_text_kwargs: KeywordArgs = None,
    parse_text_kwargs: KeywordArgs = None) -> dict[str, bool]:
    read_text_kwargs = read_text_kwargs or {}
    parse_text_kwargs = parse_text_kwargs or {}

    texts = {
        doc: read_text(doc, **read_text_kwargs)
        for doc in documents
    }
    return {
        str(doc): parse_text(text, **parse_text_kwargs)
        for doc, text in texts.items()
    }

In Appendix 1, you will find this code with extended versions of docstrings, which explain some critical details, which we also cover below.

The generalization is in two aspects: the way the data reader is implemented, and the way the data parser is implemented. We will discuss these aspects in the next section, in which we will discuss each of these three functions.

Functions

Reading data

This particular implementation takes one argument, path, being a pathlib.Path instance. Although we have implemented the read_text() function in a particular way, the pipeline function (get_dictcomp_pipeline()) is not fixed on this very implementation. You can re-implement read_text() to meet your needs; it can, for instance, read documents from PDF files, web pages, or a database.

You are free to change the function, but you have to keep several things unchanged:

  • The function must take an element of documents from get_dictcomp_pipeline() as the first argument. It's run as a positional argument, so you can name it however suits the task the function is to accomplish. In this particular implementation, it's path.
  • If the function takes more arguments, they must work as keyword arguments and should be the same for all runs of read_text().
  • The function should return the text as a string.

Parsing data

Although in our task, we look for the word "Python" in the documents, the pipeline enables the user to look for any word, thanks to the signature of the parse_data() function. It takes two arguments, text and word, both being strings. If word is found in text, the function returns True; and False otherwise.

This parsing logic ignores the case of words, which is done in a simple way via the .lower() string method for both the text and the word searched for.

As was the case with read_data(), you can change the logic of the function as well as its signature, again keeping in mind the following conditions:

  • The first argument, used as a positional argument, is the text returned by read_text(), as str.
  • If the function takes more arguments, they should be the same for all runs of parse_text(). These additional arguments must work as keyword arguments.
  • The function returns a Boolean value.

The pipeline

The run_dictcomp_pipeline() function is implemented in a way so that the two functions discussed above can have varying implementations, depending on the data source and parsing logic. This is done thanks to providing arguments to these two functions as keywords arguments, so by unpacking the corresponding dictionaries.

One limitation, as mentioned in the two subsections above, is that any additional arguments must have the same value for all documents. If you find this too restricting, you would need to reimplement the run_dictcomp_pipeline() function, at a cost of additional complexity.

Unit tests: The pipeline in action

To check out whether the pipeline works, let's implement some unit tests, using pytest. I will skip here unit tests for read_text() and parse_text() functions, but feel free to add them as an exercise.

And here, we have a test to check whether the app works as expected.

# test_dictcomp_pipeline.py

import pathlib
import pytest

from dictcomp_pipeline import get_dictcomp_pipeline

@pytest.fixture
def files():
    n_files = 11
    paths = [pathlib.Path(".") / f"txt_file_{i}.txt"
             for i in range(n_files)]
    for i, path in enumerate(paths):
        text = "Shout Bamalama!nI'm an elephant, and so what?nn"
        if i % 2 == 0:
            text = f"{text}Python"
        path.write_text(text)
    yield paths
    for path in paths:
        path.unlink()

def test_get_dictcomp_pipeline(files):
    isPython = get_dictcomp_pipeline(
        files,
        parse_text_kwargs={"word": "Python"}
    )
    assert isPython == {
        'txt_file_0.txt': True,
        'txt_file_1.txt': False,
        'txt_file_2.txt': True,
        'txt_file_3.txt': False,
        'txt_file_4.txt': True,
        'txt_file_5.txt': False,
        'txt_file_6.txt': True,
        'txt_file_7.txt': False,
        'txt_file_8.txt': True,
        'txt_file_9.txt': False,
        'txt_file_10.txt': True
        }

Let's see what the test does. The fixture files creates 11 text files, the contents of six of which includes the word "Python". These files are created when test_get_dictcomp_pipeline() is invoked. Then the test runs the pipeline function and asserts whether the output is as expected. After all, the testing text files are removed, which you will see in the last two lines of the fixture's code.

After running the test, you should see a thumbs-up from pytest:

A pipeline integrates several functionalities into one, so the above test for our pipeline function can be considered, at least to some extent, an integration test.

Even with very many documents, a dictionary should be fine to keep the output, as its type is dict[str, bool], and such a dictionary does not require too much memory. Surely, you may wish to process it as a generator anyway; in that case, you should revise the pipeline function into a generator pipeline. One solution would be to make the generator yield the values of type tuple[str, bool]. So, it'd be a tuple with a string representing a document and the corresponding Boolean value informing whether the parsing function found the word searched for.

Extending the example

In the example above, we used a bool value, but in some tasks, it could be too limiting. For instance, you may want to take into account that not all documents are SOPs. In other words, you have a number of documents some of which are SOPs and some are not; the task is to check whether a document is an SOP, and if it is, check if it contains a particular word, phrase, or a number of words/phrases. In such a case, you may either use a more complex data structure, such as the OptionalBool data structure and type, proposed in this article:

An OptionalBool Type for Python: None, False or True

In our task, an OptionalBool value of None would mean the corresponding document is not an SOP; False, that it is an SOP but does not contain the searched phrase(s); and True, that it is an SOP and contains the searched phrase(s).

This article is motivated by a real life example in which I had a situation similar to this one. Back then, I implemented the solution completely differently, but today I'd definitely consider doing it using OptionalBool and a dictcomp pipeline. In order to use OptionalBool, however, the signature of the run_dictcomp_pipeline() needs to be changed a little bit, as instead of bool we would work with OptionalBool.

You will find the revised version of the code, including the test file, in Appendix 2.

Conclusion

We have discussed a real life example of using a dictcomp pipeline. While the task we implemented was rather simple, the solution is more general, enabling the user to re-implement the two functions creating the pipeline – without a necessity to change the pipeline function itself.

You may have noticed that the dictcomp pipeline we have implemented simply looks like a Dictionary Comprehension. Because it simply is a dictionary comprehension. Just like a generator pipeline is a generator. The pipeline is hidden inside of what is being done: input → function → function → … → function → output. In our case, the pipeline was very short, but in many other scenarios, it can contain many more steps.

I wanted to show you in what sort of scenarios such a comprehension pipeline could work. This particular task, however, could be implemented in various ways. Which to choose should largely depend on what sort of code you want to produce. If you need a mere implementation of a particular task, there is no need to generalize functions the way we did. If you aim to write a framework to be used by others, you would likely generalize it even more.

Appendix 1

The code of run_dictcomp_pipeline() from the first example, with full docstrings:

# dictcomp_pipeline.py

from pathlib import Path
from collections.abc import Sequence

from typing import Any, Optional

# Type aliases
Paths = Sequence[Path]
KeywordArgs = Optional[dict[str, Any]]

def read_text(path: Path) -> str:
    """Read text from path and return a string.

    You can rewrite this function to read from another source.
    The function must return a string, but it can take any
    number of keyword arguments. The first argument must
    work as positional, and it must represent a document
    from the `documents` sequences from `run_dictcomp_pipeline()`.
    """
    return path.read_text()

def parse_text(text: str, word: str) -> bool:
    """Parse text from string and return a bool value.

    You can rewrite this function to use different
    parsing logic. The function must return a bool,
    but it can take any number of keyword arguments.
    The first argument must remain unchanged, and must
    work as a positional argument.
    """
    return word.lower() in text.lower()

def run_dictcomp_pipeline(
    documents: Any,
    read_text_kwargs: KeywordArgs = None,
    parse_text_kwargs: KeywordArgs = None) -> dict[str, bool]:
    """Run dictcomp pipeline.

    The function does not handle exceptions: if anything goes
    wrong, the pipeline breaks and the corresponding exception
    is raised.

    Args:
        paths (Paths): sequence with paths to files with
            documents to read
        word (str): word to look for in documents
        read_text_kwargs (KeywordArgs, optional): dictionary with
            keyword arguments to be used in a call to `read_text()`,
            if needed. Defaults to None, meaning that no arguments
            are passed.
        parse_text_kwargs (KeywordArgs, optional): dictionary with
            keyword arguments to be used in a call to `parse_text()`,
            if needed. Defaults to None, meaning that no arguments
            are passed.

    Returns:
        dict[Path, bool]: dictionary with the output of
            the pipeline; its result represent the parsing logic
            used in the documents
    """
    read_text_kwargs = read_text_kwargs or {}
    parse_text_kwargs = parse_text_kwargs or {}

    texts = {
        doc: read_text(doc, **read_text_kwargs)
        for doc in documents
    }
    return {
        str(doc): parse_text(text, **parse_text_kwargs)
        for doc, text in texts.items()
    }

Appendix 2

In this Appendix, you will find the extended code of the solution that works with OptionalBool values. You will also need the code for the OptionalBool class, located in the optionalbool module, which you can copy-paste from the following article:

An OptionalBool Type for Python: None, False or True

Here is the code:

Python"># optionalbool_dictcomp_pipeline.py

from pathlib import Path
from collections.abc import Sequence

from typing import Any, Optional

from optionalbool import OptionalBool

# Type aliases
Paths = Sequence[Path]
KeywordArgs = Optional[dict[str, Any]]

def read_text(path: Path) -> str:
    """Read text from path and return a string."""
    return path.read_text()

def parse_text(text: str,
               word: str,
               standards_phrases: Sequence[str]
               ) -> OptionalBool:
    """Parse text from string and return a bool value."""
    if not any(phrase.lower() in text.lower() for phrase in standards_phrases):
        return OptionalBool(None)
    return OptionalBool(word.lower() in text.lower())

def run_dictcomp_pipeline(
    documents: Any,
    read_text_kwargs: KeywordArgs = None,
    parse_text_kwargs: KeywordArgs = None
    ) -> dict[str, OptionalBool]:
    read_text_kwargs = read_text_kwargs or {}
    parse_text_kwargs = parse_text_kwargs or {}

    texts = {
        doc: read_text(doc, **read_text_kwargs)
        for doc in documents
    }
    return {
        str(doc): parse_text(text, **parse_text_kwargs)
        for doc, text in texts.items()
    }

And the test file:

# test_optionalbool_dictcomp_pipeline.py

import pathlib
import pytest

from optionalbool_dictcomp_pipeline import run_dictcomp_pipeline

from optionalbool import OptionalBool

@pytest.fixture
def files():
    n_files = 11
    paths = [pathlib.Path(".") / f"doc_file_{i}.txt"
             for i in range(n_files)]
    for i, path in enumerate(paths):
        text = "Shout Bamalama!nI'm an elephant, and so what?nn"
        if i % 2 == 0:
            text = f"{text}Python"
        if i % 3 != 0:
            text = (
                "This is a Standard Operating Proceduren"
                f"{text}"
            )
        path.write_text(text)
    yield paths
    for path in paths:
        path.unlink()

def test_get_dictcomp_pipeline(files):
    standards_phrases = ["Standard Operating Procedure", "SOP",]
    isPython = run_dictcomp_pipeline(
        files,
        parse_text_kwargs={"word": "Python",
                           "standards_phrases": standards_phrases}
    )
    for v in isPython.values():
        assert isinstance(v, OptionalBool)
    assert isPython == {
        'doc_file_0.txt': None,
        'doc_file_1.txt': False,
        'doc_file_2.txt': True,
        'doc_file_3.txt': None,
        'doc_file_4.txt': True,
        'doc_file_5.txt': False,
        'doc_file_6.txt': None,
        'doc_file_7.txt': False,
        'doc_file_8.txt': True,
        'doc_file_9.txt': None,
        'doc_file_10.txt': True
    }

Thanks for reading. If you enjoyed this article, you may also enjoy other articles I wrote; you will see them here. And if you want to join Medium, please use my referral link below:

Join Medium with my referral link – Marcin Kozak

Tags: Data Science Dictionary Comprehension Hands On Tutorials Pipeline Python

Comment