SPEED-UP YOUR CODES BY PARALLEL COMPUTING IN PYTHON

Utpal Kumar   6 minute read   
   visitor badge  

Parallel computing is quickly becoming a necessity. Modern computers comes with more than one process and we most often only use single process to do most of our tasks. Parallel computing is a way of diving the tasks into all the processes available on the system and achieve speed ups.

Introduction

With the increasing amount of data, parallel computing is quickly becoming a necessity. Modern computers come with more than one cores, but we most often only use a single process to do most of our tasks. Parallel computing is a way of dividing the tasks into all or selected number of the system cores and achieving speedups.

Python by itself is slow. But fortunately there are several libraries in Python that can help in performing parallel computations and some to just speed up the single thread job. This post will discuss the basics of the parallel computing libraries, such as multiprocessing (and Threading), and joblib. After reading this article, I hope that you would be able to feel more confident on this topic.

The right way to loop in Python

What is the fastest and most efficient way to loop in Python. We found that the numpy is fastest and python builtin is the most memory efficient

Threading Module

Python comes with the threading API, that allows us to have different parts of the program run concurrently. Many people confuse threading with multiprocessing. To make things simples, we can remember that threading is not strictly parallel computation as it appears to (though it will definitely give speed up to your program). The threading maybe running on your multiple processes but only one task will be done at a time. One may think that in that case, multiprocessing sounds better. But multiprocessing always comes with some extra overhead as your system needs to spawn different processes and make it ready for the task. But threading also has limitation because of the GIL in Python that limits it to run only one thread at a time.

Now, let us see an example that uses threading to speedup the task.

Using concurrent.futures

Multiprocessing Module

Python has an in-built parallel computing module multiprocessing to allow us to write parallel codes by spawning processes on our system. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. It offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes. For details, visit python docs. The notable things to remember are: how to start a process, how to exchange objects between processes (includes queuing and piping for communication between processes),

There is a difference between multiprocessing and threading. threading run the tasks concurrently, however, multiprocessing run the tasks parallely at the same time. Both of them can achieve significant improvement in the speed and one is preferable over other depending on the nature of the tasks.

In this post, we will have a look at the essentials of this module. Before getting into the nitty-gritty of this module, let us import it and check the number of cores on your computer.

import multiprocessing

print("Number of CPUs : ", multiprocessing.cpu_count())

This outputs Number of CPUs : 12.

import time
import multiprocessing

def my_awesome_function(sleepTime):
    time.sleep(sleepTime)


Stime = time.perf_counter()
my_awesome_function(0.1)
my_awesome_function(1)
my_awesome_function(0.3)
print(f"Finished in {time.perf_counter()-Stime:.2f}")

The above code shows the sequential run of a dummy function named my_awesome_function. All this function does is sleep for a specified number of seconds. This kind of function is simply a representation of a function that takes sleepTime amount of time to achieve the computation. The run time for the above code is 1.40 seconds as expected.

The problem with sequential kind of coding (also called running the function synchronously) is that with more number of computations or tasks, the amount of time required to finish it is linearly increasing. We can mitigate such issues by employing the other CPUs in our computer.

For parallel computing, we can break these three tasks into 3 processes. And unlike running sequentially in the previous case, multiprocessing run all the tasks at the same time.

import time
import multiprocessing

def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    print(f"function with {sleepTime} time finished")


Stime = time.perf_counter()
p1 = multiprocessing.Process(target=my_awesome_function)
p2 = multiprocessing.Process(target=my_awesome_function)
p3 = multiprocessing.Process(target=my_awesome_function)

## start the process
p1.start()
p2.start()
p3.start()

## script will continue while the process starts...

print(f"Finished in {time.perf_counter()-Stime:.2f}")

This will output:

Finished in 0.01
function with 0.1 time finished
function with 0.1 time finished
function with 0.1 time finished

In the above script, we may think the output is strange. But the script finished first while the three processes were starting. We can avoid this by using the join method.

import time
import multiprocessing


def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    print(f"function with {sleepTime} time finished")


Stime = time.perf_counter()
p1 = multiprocessing.Process(target=my_awesome_function)
p2 = multiprocessing.Process(target=my_awesome_function)
p3 = multiprocessing.Process(target=my_awesome_function)

## start the process
p1.start()
p2.start()
p3.start()

p1.join()
p2.join()
p3.join()

print(f"Finished in {time.perf_counter()-Stime:.2f}")

This will output:

function with 0.1 time finished
function with 0.1 time finished
function with 0.1 time finished
Finished in 0.11

Great!! Now we are able to make our script run in parallel and achieve speedups. But what if our script gives some output that we want to store or use, or what if the script takes one or more arguments. How can we deal with such cases?

Using process

If you compare the time taken by this script with that of the threading, you can notice that the time taken by both are almost same.

Using concurrent.futures with submit

The above call of the ProcessPoolExecutor with the context manager will use all the threads on the system. This leaves the system completely occupied until the assigned tasks are finished. I found that using only 80% of the threads works great for performance as well as it leaves some memory for other tasks. We can specify 80% of the thread using max_workers=int(0.8*multiprocessing.cpu_count()) as argument for the ProcessPoolExecutor .

Using concurrent.futures with map

The executor.map function works just like the Python map function. The takes in a function as first argument and an iterable as second argument.

The results obtained using executor.map function is synchronous.

Threading vs Multiprocessing

We now have a basic understanding of the two ways to speedup our programs. However, the question arises, which is the better way. There is no one answer to that. It depends strictly on your program. If your program spend much of its time waiting for external events are generally good candidates for threading (I/O bound tasks). Programs that require heavy CPU computation (CPU bound tasks) and spend little time waiting for external events might not run faster at all. In such cases, multiprocessing is a good alternative.

Joblib Module

Joblib provides a set of tools for lightweight pipelining in Python. It is fast and is optimized for the numpy arrays. So, we can employ it for several applications of data science. The main focus of joblib is to reduce repetitive computation and persist to disk; hence it has significant advantages for large data computation.

Here, we can have a look at the parallel computation functionality of the joblib library. We use a simple function, function_to_test, and made it to sleep for 0.1 sec to imitate functions that take longer to compute. We made this dummy function more complex by adding two required arguments. We use the number of CPUs in the system for the number of parallel processes. In my case, the cpu_count is 12.

We used the time.perf_counter() to compute the time taken by the parallel computation and by the sequential computation.

Joblib finished in 2.98 secs
Numpy finished in 30.95 secs

The total number of computations for this example is 30*10=300, and each function takes ~0.1 seconds to run. So, it is expected for the sequential computation to take around 300*0.1=30 seconds. But the parallel algorithm can achieve this much faster. Since I have 12 CPUs, joblib divided the task into 12 processes, hence the speed jump of 30/12=2.5. The difference in the expected time of 2.5 sec and the actual time taken (2.98 sec) comes because of the overhead associated with the parallel computation. This overhead is the reason parallel computation is not recommended for smaller sets of data.

References

  1. Python Threading Tutorial: Run Code Concurrently Using the Threading Module

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.

Leave a comment