Simple loop parallelization in Python

Sometimes you are programming a loop to run over tasks that could be easily parallelized. Usual suspects include loads that wait IO like calls to third party API services.

Since Python 3.2, there have been easy tool for this kind of jobs. concurrent.futures standard library module provides thread and multiprocess pools for executing tasks parallel. For older Python versions, a backport library exists.

Consider a loop that waits RPC traffic and the RPC has a wide enough pipe to handle multiple calls simultaneously:

def import_all(contract: Contract, fname: str):
    """Import all entries from a given CSV file."""

    for row in read_csv(fname):
        # This functions performs multiple RPC calls
        # with wait between calls
        import_invoicing_address(contract, row)

You can create a thread pool that runs tasks on N worker threads. Tasks are wrapped in futures that call the worker function. Each thread keeps consuming tasks from the queue until all of work is done.

import concurrent.futures

def import_all_pooled(contract: Contract, fname: str, workers=32):
    """Parallerized CSV import."""

    # Run the futures within this thread pool
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:

        # Stream incoming data and build futures.
        # The execution of futures beings right away and the executor
        # does not wait the loop to be completed.
        futures = [executor.submit(import_invoicing_address, contract, row) for row in read_csv(fname)]

        # This print may be slightly delayed, as futures start executing as soon as the pool begins to fill,
        # eating your CPU time
        print("Executing total", len(futures), "jobs")

        # Wait the executor to complete each future, give 180 seconds for each job
        for idx, future in enumerate(concurrent.futures.as_completed(futures, timeout=180.0)):
            res = future.result()  # This will also raise any exceptions
            print("Processed job", idx, "result", res)

If the work is not CPU intensive then Python’s infamous Global Interpreter Global will not become an issue either.

$\"\"$ Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+