Speed-up your codes by parallel computing in Python (codes included)

Utpal Kumar   8 minute read      

Parallel computing is quickly becoming a necessity. Modern computers comes with more than one process and we most often only use single process to do most of our tasks. Parallel computing is a way of diving the tasks into all the processes available on the system and achieve speed ups.

Introduction

With the increasing amount of data, parallel computing is quickly becoming a necessity. Modern computers come with more than one cores, but we most often only use a single process to do most of our tasks. Parallel computing is a way of dividing the tasks into all or selected number of the system cores and achieving speedups.

Python by itself is slow. But fortunately there are several libraries in Python that can help in performing parallel computations and some to just speed up the single thread job. This post will discuss the basics of the parallel computing libraries, such as multiprocessing (and Threading), and joblib. After reading this article, I hope that you would be able to feel more confident on this topic.

The right way to loop in Python

What is the fastest and most efficient way to loop in Python. We found that the numpy is fastest and python builtin is the most memory efficient

Threading Module

Python comes with the threading API, that allows us to have different parts of the program run concurrently. It spawns one or more new threads within the existing process, and the threads can share memory. Many people confuse threading with multiprocessing. To make things simples, we can remember that threading is not strictly parallel computation as it appears to (though it will definitely give speed up to your program). The threading maybe running on your multiple processes but only one task will be done at a time. One may think that in that case, multiprocessing sounds better. But multiprocessing always comes with some extra overhead as your system needs to spawn different processes and make it ready for the task. It also doesn’t have shared memory (though it is possible to create an infrastruture for shared memory with some efforts). Threading also has a limitation because of the Global Interpreter Lock (GIL) in Python that limits it to run only one thread at a time.

Now, let us see an example that uses threading to speedup the task.

Using concurrent.futures

Multiprocessing Module

Python has an in-built parallel computing module multiprocessing to allow us to write parallel codes by spawning processes on our system. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. It offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes. For details, visit python docs. The notable things to remember are: how to start a process, how to exchange objects between processes (includes queuing and piping for communication between processes).

In this post, we will have a look at the essentials of this module. Before getting into the nitty-gritty of this module, let us import it and check the number of cores on your computer.

import multiprocessing

print("Number of CPUs : ", multiprocessing.cpu_count())

This outputs Number of CPUs : 12.

import time
import multiprocessing

def my_awesome_function(sleepTime):
    time.sleep(sleepTime)


Stime = time.perf_counter()
my_awesome_function(0.1)
my_awesome_function(1)
my_awesome_function(0.3)
print(f"Finished in {time.perf_counter()-Stime:.2f}")

The above code shows the sequential run of a dummy function named my_awesome_function. All this function does is sleep for a specified number of seconds. This kind of function is simply a representation of a function that takes sleepTime amount of time to achieve the computation. The run time for the above code is 1.40 seconds as expected.

The problem with sequential kind of coding (also called running the function synchronously) is that with more number of computations or tasks, the amount of time required to finish it is linearly increasing. We can mitigate such issues by employing the other CPUs in our computer.

For parallel computing, we can break these three tasks into 3 processes. And unlike running sequentially in the previous case, multiprocessing run all the tasks at the same time.

import time
import multiprocessing

def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    print(f"function with {sleepTime} time finished")


Stime = time.perf_counter()
p1 = multiprocessing.Process(target=my_awesome_function)
p2 = multiprocessing.Process(target=my_awesome_function)
p3 = multiprocessing.Process(target=my_awesome_function)

## start the process
p1.start()
p2.start()
p3.start()

## script will continue while the process starts...

print(f"Finished in {time.perf_counter()-Stime:.2f}")

This will output:

Finished in 0.01
function with 0.1 time finished
function with 0.1 time finished
function with 0.1 time finished

In the above script, we may think the output is strange. But the script finished first while the three processes were starting. We can avoid this by using the join method.

import time
import multiprocessing


def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    print(f"function with {sleepTime} time finished")


Stime = time.perf_counter()
p1 = multiprocessing.Process(target=my_awesome_function)
p2 = multiprocessing.Process(target=my_awesome_function)
p3 = multiprocessing.Process(target=my_awesome_function)

## start the process
p1.start()
p2.start()
p3.start()

p1.join()
p2.join()
p3.join()

print(f"Finished in {time.perf_counter()-Stime:.2f}")

This will output:

function with 0.1 time finished
function with 0.1 time finished
function with 0.1 time finished
Finished in 0.11

Great!! Now we are able to make our script run in parallel and achieve speedups. But what if our script gives some output that we want to store or use, or what if the script takes one or more arguments. How can we deal with such cases?

Using process

If you compare the time taken by this script with that of the threading, you can notice that the time taken by both are almost same.

Using concurrent.futures with submit

The above call of the ProcessPoolExecutor with the context manager will use all the threads on the system. This leaves the system completely occupied until the assigned tasks are finished. I found that using only 80% of the threads works great for performance as well as it leaves some memory for other tasks. We can specify 80% of the thread using max_workers as argument for the ProcessPoolExecutor .

...
max_workers = int(0.8*multiprocessing.cpu_count())
all_results = []
with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
    tasks = [executor.submit(my_awesome_function, sleep) for sleep in sleepTimes]

    for ff in concurrent.futures.as_completed(tasks):
        all_results.append(ff.result())

print(all_results)
...

Using concurrent.futures with map

The executor.map function works just like the Python map function. The takes in a function as first argument and an iterable as second argument.

The results obtained using executor.map function is synchronous.

Threading vs Multiprocessing

We now have a basic understanding of speeding up our programs using multiprocessing and threading and some differences between them. The main difference between them is that threads get executed in the shared memory space while processes may have independent memory space. threading run the tasks concurrently (control flow is usually non-deterministic, i.e., the execution of tasks may not happen in the same order everytime), and multiprocessing run the tasks parallely at the same time, however it is not strictly true. Both of them can achieve significant improvement in the speed and one is preferable over other depending on the nature of the tasks.

Another important difference between the application of threading and multiprocessing comes from the Global Interpreter Lock (GIL). For threading tasks, Python gives us one GIL for all threads. For multiprocessing tasks, we have one GIL for each process. To understand this more clearly, I suggest you to execute the threading/multiprocessing code (see above), and check the cpu usuage on your computer. You will see that threading (even with hundreds of tasks) will saturate the cpu usage to 100%, that’s because of the GIL.

However, the question arises, which is the better way. There is no one answer to that. It depends strictly on your program. If your program spend much of its time waiting for external events are generally good candidates for threading (I/O bound tasks) such as web-crawling, reading and writing data, etc. Multiprocessing is a good alternative for tasks that require heavy CPU computation (CPU bound tasks) and spend little time waiting for external events might not run faster at all. Some of the examples are mathematical computations that usually requires heavy computations.

Joblib Module

Joblib provides a set of tools for lightweight pipelining in Python. It is fast and is optimized for the numpy arrays. So, we can employ it for several applications of data science. The main focus of joblib is to reduce repetitive computation and persist to disk; hence it has significant advantages for large data computation.

Here, we can have a look at the parallel computation functionality of the joblib library. We use a simple function, function_to_test, and made it to sleep for 0.1 sec to imitate functions that take longer to compute. We made this dummy function more complex by adding two required arguments. We use the number of CPUs in the system for the number of parallel processes. In my case, the cpu_count is 12.

We used the time.perf_counter() to compute the time taken by the parallel computation and by the sequential computation.

Joblib finished in 2.98 secs
Numpy finished in 30.95 secs

The total number of computations for this example is 30*10=300, and each function takes ~0.1 seconds to run. So, it is expected for the sequential computation to take around 300*0.1=30 seconds. But the parallel algorithm can achieve this much faster. Since I have 12 CPUs, joblib divided the task into 12 processes, hence the speed jump of 30/12=2.5. The difference in the expected time of 2.5 sec and the actual time taken (2.98 sec) comes because of the overhead associated with the parallel computation. This overhead is the reason parallel computation is not recommended for smaller sets of data.

Multiprocessing vs mpi4py

Multiprocessing module is useful to parallelize Python scripts on shared memory systems. However, it does not support parallelization over multiple compute nodes (or distributed memory systems). To distribute tasks across multiple nodes, we can use MPI (mpi4py module) in Python. I will cover the advantages and distadvantages of using the mpi4py module in the future posts.

References

  1. Python Threading Tutorial: Run Code Concurrently Using the Threading Module

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.


Leave a comment