AlphaPept deals with high-throughput data. As this can be computationally intensive, we try to make all functions as performant as possible. To do so, we rely on two principles: * Compilation * Parallelization
A first step of compilation can be achieved by using NumPy arrays which are already heavily c-optimized. Net we consider three kinds of compilation: * Python This allows to use no compilation * Numba This allows to use just-in-time (JIT) compilation. * Cuda This allows compilation on the GPU.
All of these compilation approaches can be combined with parallelization approaches. We consider the following possibilities: * No parallelization Not all functionality can be parallelized. * Multithreading This is only performant when Python’s global interpreter lock (GIL) is released or when mostly using input-/output (IO) functions. * GPU This is only available if an NVIDIA GPU is available and properly configured.
Note that not all compilation approaches can sensibly be combined with all parallelization approaches.
Next we import all libraries, taking into account that not every machine has a GPU (with NVidia cuda cores) available:
Args: worker_count (int): The number of workers. If larger than available cores, it is trimmed to the available maximum. If 0, it is set to the maximum cores available. If negative, it indicates how many cores NOT to use. Default is 1 set_global (bool): If False, the number of workers is only parsed to a valid value. If True, the number of workers is saved as a global variable. Default is True.
Returns: int: The parsed worker_count.
Compiled functions are intended to be very fast. However, they do not have the same flexibility as pure Python functions. In general, we recommend to use staticly defined compilation functions for optimal performance. We provide the option to define a default compilation mode for decorated functions, while also allowing to define the compilation mode for each individual function.
NOTE: Compiled functions are by default expected to be performed on a single thread. Thus, ‘cuda’ funtions are always assumed to be device functions which makes them callable from within the GPU, unless explicitly stated otherwise. Similarly, ‘numba’ functions are always assumed to bo ‘nopython’ and ‘nogil’.
NOTE If the global compilation mode is set to Python, all decorators default to python, even if a specific compilation_mode is provided.
In addition, we allow to enable dynamic compilation, meaning the compilation mode of functions can be changed at runtime. Do note that this comes at the cost of some performance, as compilation needs to be done at runtime as well. Moreover, functions that are defined with dynamic compilation can not be called from within other compiled functions (with the exception of ‘python’ compilation, which means no compilation is actually performe|d).
NOTE: Dynamic compilation must be enabled before functions are decorated to take effect at runtime, otherwise they are statically compiled with the current settings at the time they are defined! Alternatively, statically compiled functions of a an ‘imported_module’ can reloaded (and thus statically be recompiled) with the commands:
Numba functions are by default set to use nogil=True and nopython=True, unless explicitly defined otherwise. Cuda functions are by default set to use device=True, unless explicitly defined otherwise..
Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. If None, the global COMPILATION_MODE will be used as soon as the function is decorated for static compilation. If DYNAMIC_COMPILATION_ENABLED, the function will always be compiled at runtime and thus by default returns a Python function. Static recompilation can be enforced by reimporting a module containing the function with importlib.reload(imported_module). If COMPILATION_MODE is Python and not DYNAMIC_COMPILATION_ENABLED, no compilation will be used. Default is None **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.
Returns: callable: A decorated function that is compiled.
Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. Default is None enable_dynamic_compilation (bool): Enable dynamic compilation. If enabled, code will generally be slower and no other functions can be called from within a compiled function anymore, as they are compiled at runtime. WARNING: Enabling this is strongly disadvised in almost all cases! Default is None.
Testing yields the expected results:
import typesset_compilation_mode(compilation_mode="numba-multithread")@compile_function(compilation_mode="python")def test_func_python(x):"""Docstring test""" x[0] +=1@compile_function(compilation_mode="numba")def test_func_numba(x):"""Docstring test""" x[0] +=1set_compilation_mode(enable_dynamic_compilation=True)@compile_functiondef test_func_dynamic_runtime(x):"""Docstring test""" x[0] +=1set_compilation_mode(enable_dynamic_compilation=False, compilation_mode="numba-multithread")@compile_functiondef test_func_static_runtime_numba(x):"""Docstring test""" x[0] +=1a = np.zeros(1, dtype=np.int64)assert(isinstance(test_func_python, types.FunctionType))test_func_python(a)assert(np.all(a == np.ones(1)))a = np.zeros(1)assert(isinstance(test_func_numba, numba.core.registry.CPUDispatcher))test_func_numba(a)assert(np.all(a == np.ones(1)))if __GPU_AVAILABLE:@compile_function(compilation_mode="cuda", device=None)def test_func_cuda(x):"""Docstring test""" x[0] +=1# Cuda function cannot be tested from outside the GPU a = np.zeros(1)assert(isinstance(test_func_cuda, numba.cuda.compiler.Dispatcher)) test_func_cuda.forall(1,1)(a)assert(np.all(a == np.ones(1)))set_compilation_mode(compilation_mode="python")a = np.zeros(1)assert(isinstance(test_func_static_runtime_numba, numba.core.registry.CPUDispatcher))test_func_static_runtime_numba(a)assert(np.all(a == np.ones(1)))set_compilation_mode(compilation_mode="python")a = np.zeros(1)assert(isinstance(test_func_dynamic_runtime, types.FunctionType))test_func_dynamic_runtime(a)assert(np.all(a == np.ones(1)))set_compilation_mode(compilation_mode="numba")a = np.zeros(1)assert(isinstance(test_func_dynamic_runtime, types.FunctionType))test_func_dynamic_runtime(a)assert(np.all(a == np.ones(1)))# # Cuda function cannot be tested from outside the GPU# set_compilation_mode(compilation_mode="cuda")# a = np.zeros(1)# assert(isinstance(test_func_dynamic_runtime, types.FunctionType))# test_func_dynamic_runtime.forall(1,1)(a)# assert(np.all(a == np.ones(1)))
C:\ProgramData\Miniconda3\envs\alphapept\lib\site-packages\numba\cuda\compiler.py:726: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (136) will likely result in GPU under utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
C:\ProgramData\Miniconda3\envs\alphapept\lib\site-packages\numba\cuda\cudadrv\devicearray.py:885: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
warn(NumbaPerformanceWarning(msg))
Next, we define the ‘performance_function’ decorator to take full advantage of both compilation and parallelization for maximal performance. Note that a ‘performance_function’ can not return values. Instead, it should store results in provided buffer arrays.
A decorator to compile a given function and allow multithreading over an multiple indices.
NOTE This should only be used on functions that are compilable. Functions that need to be decorated need to have an index argument as first argument. If an iterable is provided to the decorated function, the original (compiled) function will be applied to all elements of this iterable. The most efficient way to provide iterables are with ranges, but numpy arrays work as well. Functions can not return values, results should be stored in buffer arrays inside thge function instead.
Args: worker_count (int): The number of workers to use for multithreading. If None, the global MAX_WORKER_COUNT is used at runtime. Default is None. compilation_mode (str): The compilation mode to use. Will be forwarded to the compile_function decorator. **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.
Returns: callable: A decorated function that is compiled and parallelized.
We test this function with a simple smoothing algorithm.
CPU times: total: 2.55 s
Wall time: 2.54 s
CPU times: total: 7.47 s
Wall time: 7.49 s
CPU times: total: 11 s
Wall time: 887 ms
Finally, we also provide functionality to use multiprocessing instead of multithreading.
NOTE: There are some inherent limitation with the number of processes that Python can spawn. As such, no process Pool should use more than 50 processes.