Performance

Paralellization for GPU and CPU

AlphaPept deals with high-throughput data. As this can be computationally intensive, we try to make all functions as performant as possible. To do so, we rely on two principles: * Compilation * Parallelization

A first step of compilation can be achieved by using NumPy arrays which are already heavily c-optimized. Net we consider three kinds of compilation: * Python This allows to use no compilation * Numba This allows to use just-in-time (JIT) compilation. * Cuda This allows compilation on the GPU.

All of these compilation approaches can be combined with parallelization approaches. We consider the following possibilities: * No parallelization Not all functionality can be parallelized. * Multithreading This is only performant when Python’s global interpreter lock (GIL) is released or when mostly using input-/output (IO) functions. * GPU This is only available if an NVIDIA GPU is available and properly configured.

Note that not all compilation approaches can sensibly be combined with all parallelization approaches.

Next we import all libraries, taking into account that not every machine has a GPU (with NVidia cuda cores) available:


source

is_valid_compilation_mode

 is_valid_compilation_mode (compilation_mode:str)

Check if the provided string is a valid compilation mode.

Args: compilation_mode (str): The compilation mode to verify.

Raises: ModuleNotFoundError: When trying to use an unavailable GPU. NotImplementedError: When the compilation mode is not valid.

By default, we will use cuda if it is available. If not, numba-multithread will be used as default.

To consistently use multiple threads or processes, we can set a global MAX_WORKER_COUNT parameter.


source

set_worker_count

 set_worker_count (worker_count:int=1, set_global:bool=True)

Parse and set the (global) number of threads.

Args: worker_count (int): The number of workers. If larger than available cores, it is trimmed to the available maximum. If 0, it is set to the maximum cores available. If negative, it indicates how many cores NOT to use. Default is 1 set_global (bool): If False, the number of workers is only parsed to a valid value. If True, the number of workers is saved as a global variable. Default is True.

Returns: int: The parsed worker_count.

Compiled functions are intended to be very fast. However, they do not have the same flexibility as pure Python functions. In general, we recommend to use staticly defined compilation functions for optimal performance. We provide the option to define a default compilation mode for decorated functions, while also allowing to define the compilation mode for each individual function.

NOTE: Compiled functions are by default expected to be performed on a single thread. Thus, ‘cuda’ funtions are always assumed to be device functions which makes them callable from within the GPU, unless explicitly stated otherwise. Similarly, ‘numba’ functions are always assumed to bo ‘nopython’ and ‘nogil’.

NOTE If the global compilation mode is set to Python, all decorators default to python, even if a specific compilation_mode is provided.

In addition, we allow to enable dynamic compilation, meaning the compilation mode of functions can be changed at runtime. Do note that this comes at the cost of some performance, as compilation needs to be done at runtime as well. Moreover, functions that are defined with dynamic compilation can not be called from within other compiled functions (with the exception of ‘python’ compilation, which means no compilation is actually performe|d).

NOTE: Dynamic compilation must be enabled before functions are decorated to take effect at runtime, otherwise they are statically compiled with the current settings at the time they are defined! Alternatively, statically compiled functions of a an ‘imported_module’ can reloaded (and thus statically be recompiled) with the commands:

import importlib
importlib.reload(imported_module)

source

compile_function

 compile_function (_func:<built-infunctioncallable>=None,
                   compilation_mode:str=None, **decorator_kwargs)

A decorator to compile a given function.

Numba functions are by default set to use nogil=True and nopython=True, unless explicitly defined otherwise. Cuda functions are by default set to use device=True, unless explicitly defined otherwise..

Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. If None, the global COMPILATION_MODE will be used as soon as the function is decorated for static compilation. If DYNAMIC_COMPILATION_ENABLED, the function will always be compiled at runtime and thus by default returns a Python function. Static recompilation can be enforced by reimporting a module containing the function with importlib.reload(imported_module). If COMPILATION_MODE is Python and not DYNAMIC_COMPILATION_ENABLED, no compilation will be used. Default is None **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.

Returns: callable: A decorated function that is compiled.


source

set_compilation_mode

 set_compilation_mode (compilation_mode:str=None,
                       enable_dynamic_compilation:bool=None)

Set the global compilation mode to use.

Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. Default is None enable_dynamic_compilation (bool): Enable dynamic compilation. If enabled, code will generally be slower and no other functions can be called from within a compiled function anymore, as they are compiled at runtime. WARNING: Enabling this is strongly disadvised in almost all cases! Default is None.

Testing yields the expected results:

import types

set_compilation_mode(compilation_mode="numba-multithread")

@compile_function(compilation_mode="python")
def test_func_python(x):
    """Docstring test"""
    x[0] += 1
    
@compile_function(compilation_mode="numba")
def test_func_numba(x):
    """Docstring test"""
    x[0] += 1

set_compilation_mode(enable_dynamic_compilation=True)

@compile_function
def test_func_dynamic_runtime(x):
    """Docstring test"""
    x[0] += 1

set_compilation_mode(enable_dynamic_compilation=False, compilation_mode="numba-multithread")

@compile_function
def test_func_static_runtime_numba(x):
    """Docstring test"""
    x[0] += 1

a = np.zeros(1, dtype=np.int64)
assert(isinstance(test_func_python, types.FunctionType))
test_func_python(a)
assert(np.all(a == np.ones(1)))

a = np.zeros(1)
assert(isinstance(test_func_numba, numba.core.registry.CPUDispatcher))
test_func_numba(a)
assert(np.all(a == np.ones(1)))

if __GPU_AVAILABLE:
    @compile_function(compilation_mode="cuda", device=None)
    def test_func_cuda(x):
        """Docstring test"""
        x[0] += 1

    # Cuda function cannot be tested from outside the GPU
    a = np.zeros(1)
    assert(isinstance(test_func_cuda, numba.cuda.compiler.Dispatcher))
    test_func_cuda.forall(1,1)(a)
    assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="python")
a = np.zeros(1)
assert(isinstance(test_func_static_runtime_numba, numba.core.registry.CPUDispatcher))
test_func_static_runtime_numba(a)
assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="python")
a = np.zeros(1)
assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
test_func_dynamic_runtime(a)
assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="numba")
a = np.zeros(1)
assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
test_func_dynamic_runtime(a)
assert(np.all(a == np.ones(1)))

# # Cuda function cannot be tested from outside the GPU
# set_compilation_mode(compilation_mode="cuda")
# a = np.zeros(1)
# assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
# test_func_dynamic_runtime.forall(1,1)(a)
# assert(np.all(a == np.ones(1)))
C:\ProgramData\Miniconda3\envs\alphapept\lib\site-packages\numba\cuda\compiler.py:726: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (136) will likely result in GPU under utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
C:\ProgramData\Miniconda3\envs\alphapept\lib\site-packages\numba\cuda\cudadrv\devicearray.py:885: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))

Next, we define the ‘performance_function’ decorator to take full advantage of both compilation and parallelization for maximal performance. Note that a ‘performance_function’ can not return values. Instead, it should store results in provided buffer arrays.


source

performance_function

 performance_function (_func:<built-infunctioncallable>=None,
                       worker_count:int=None, compilation_mode:str=None,
                       **decorator_kwargs)

A decorator to compile a given function and allow multithreading over an multiple indices.

NOTE This should only be used on functions that are compilable. Functions that need to be decorated need to have an index argument as first argument. If an iterable is provided to the decorated function, the original (compiled) function will be applied to all elements of this iterable. The most efficient way to provide iterables are with ranges, but numpy arrays work as well. Functions can not return values, results should be stored in buffer arrays inside thge function instead.

Args: worker_count (int): The number of workers to use for multithreading. If None, the global MAX_WORKER_COUNT is used at runtime. Default is None. compilation_mode (str): The compilation mode to use. Will be forwarded to the compile_function decorator. **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.

Returns: callable: A decorated function that is compiled and parallelized.

We test this function with a simple smoothing algorithm.

def smooth_func(index, in_array, out_array, window_size):
    min_index = max(index - window_size, 0)
    max_index = min(index + window_size + 1, len(in_array))
    smooth_value = 0
    for i in range(min_index, max_index):
        smooth_value += in_array[i]
    out_array[index] += smooth_value / (max_index - min_index)


set_compilation_mode(compilation_mode="numba-multithread")
set_worker_count(0)
array_size = 10**6
smooth_factor = 10**4

# python test
in_array = np.arange(array_size)
out_array = np.zeros_like(in_array)

func = performance_function(compilation_mode="python")(smooth_func)


# numba test
in_array = np.arange(array_size)
out_array = np.zeros_like(in_array)

func = performance_function(compilation_mode="numba")(smooth_func)


# numba-multithread test
in_array = np.arange(array_size)
out_array = np.zeros_like(in_array)

func = performance_function(compilation_mode="numba-multithread")(smooth_func)


# cuda test
if __GPU_AVAILABLE:
    in_array = cupy.arange(array_size)
    out_array = cupy.zeros_like(in_array)

    func = performance_function(compilation_mode="cuda")(smooth_func)
CPU times: total: 2.55 s
Wall time: 2.54 s
CPU times: total: 7.47 s
Wall time: 7.49 s
CPU times: total: 11 s
Wall time: 887 ms

Finally, we also provide functionality to use multiprocessing instead of multithreading.

NOTE: There are some inherent limitation with the number of processes that Python can spawn. As such, no process Pool should use more than 50 processes.


source

AlphaPool

 AlphaPool (process_count:int)

Create a multiprocessing.Pool object.

Args: process_count (int): The number of processes. If larger than available cores, it is trimmed to the available maximum.

Returns: multiprocessing.Pool: A Pool object to parallelize functions with multiple processes.