Comparing threading and multiprocessing

[1]:
import time

import multiprocessing as mp
from multiprocessing import Pool as ProcessPool
from multiprocessing.pool import ThreadPool

import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
[2]:
sns.set_context('talk')

Introduction

There are various choices when trying to run code in parallel.

The threading module will run all threads on the same CPU core which requires less overhead and allows for more efficient sharing of memory. However, it is not truly parallel and executes the threads when others are idling.

The multiprocessing module runs the processes on multiple CPU cores and can thus execute code at the same time.

Preparations

To investigate the differences between threading and multiprocessing, we will simulate work for each data point and measure when it was executed.

Due to some design choices, we have to import the worker function from a separate module for the ProcessPool to work.

[3]:
%%writefile worker.py

import time


def worker(data):
    tmp = []
    for i in data:
        # simulate CPU load
        for _ in range(1_000_000):
            pass

        # store execution time
        tmp.append(time.time())
    return tmp
Overwriting worker.py
[4]:
from worker import worker
[5]:
data = list(range(20))
num = mp.cpu_count()

Computations

We run the worker function on the dataset for each executor in both the process and thread pool.

[6]:
%%time
with ThreadPool(num) as p:
    thread_result = p.map(worker, [data] * num)
CPU times: user 3.79 s, sys: 131 ms, total: 3.92 s
Wall time: 3.86 s
[7]:
%%time
with ProcessPool(num) as p:
    process_result = p.map(worker, [data] * num)
CPU times: user 26 ms, sys: 56.7 ms, total: 82.7 ms
Wall time: 1.4 s

Next, we store the result in a dataframe.

[8]:
df_thread = pd.melt(
    pd.DataFrame(thread_result, index=[f'job {i:2}' for i in range(num)]).T
)
df_thread['type'] = 'thread'

df_process = pd.melt(
    pd.DataFrame(process_result, index=[f'job {i:2}' for i in range(num)]).T
)
df_process['type'] = 'process'

df = pd.concat([df_thread, df_process], ignore_index=True)
df['value'] = df.groupby('type')['value'].apply(lambda x: x - x.min())  # normalize time
df.head()
[8]:
variable value type
0 job 0 0.105189 thread
1 job 0 0.316431 thread
2 job 0 0.592710 thread
3 job 0 0.704985 thread
4 job 0 0.822085 thread

Investigation

There are two main observations:

  • multiprocessing execution takes less total runtime than threading

  • multiprocessing timestamps are in parallel, while threading timestamps are serial

[9]:
g = sns.FacetGrid(df, row='type', aspect=2)
g.map_dataframe(sns.scatterplot, x='value', y='variable')
g.set_axis_labels('Time [s]', 'Pool')
[9]:
<seaborn.axisgrid.FacetGrid at 0x140feb100>
../_images/misc_ThreadingVSMultiprocessing_17_1.png

The decision whether to use threading or multiprocessing depends on the use case.

As a general rule of thumb, one should use threading if the problem is IO bound and multiprocessing if it is CPU bound.