Characterizing 'vectorized' and 'guvectorized' for different amounts of data and compiler targets
On Numba's JIT Vectorization Capabilities
The Python package Numba is a JIT compiler that translates a subset of Python and NumPy code into machine code. Among other features, it allows to generate NumPy ufuncs via numba.vectorize and generalized ufuncs via numba.guvectorize. The mentioned types of functions can be compiled for different targets, i.e. for CPU, both single- and multi-threaded (parallel), as well as for CUDA on Nvidia GPUs. I am analyzing the performance of a simple compiled demo workload across different sizes of input data and different compiler targets. TLDR: numba.vectorize
and numba.guvectorize
show near-identical scaling behavior. An array of size more than 10^3 is required to saturate 24 CPU cores. CUDA shows its strengths north of 10^4.
Background
This analysis is part of a project to introduce array types into the Python package Poliastro. It is funded via a NumFOCUS Small Development Grant.
Poliastro already heavily relies in Numba, so a deeper analysis of Numba's features, capabilities and performance was required before making any specific design decisions around how to do array computations. numba.vectorize
and numba.guvectorize
proved to be interesting early on. They not only offer broadcasting semantics, but they also allow to specify compiler targets: cpu
(single threaded), parallel
(multiple threads on CPU) and cuda
(for Nvidia GPUs). This is where I got interested in the scaling behavior of code compiled for different targets via both decorators.
The following tests were performed on an AMD Epyc 7443p in performance mode at basically full boost clock speed and an Nvidia RTX A5000. On the software side, CPython 3.10.5, Numpy 1.23.1, Numba 0.56, llvmlite 0.39.0 and CUDA 11.3 were used on top of Ubuntu 20.04 LTS.
Relevant imports
The following imports are relevant for running the benchmark.
Note the the constant COMPLEXITY
allows to make the artificial workload run longer. 2^11 is roughly equivalent to the complexity encountered in a variety of algorithms found in Poliastro.
Base implementation
The following piece of code serves to verify the results of compiled code later on. It is a mix of pure Python plus numpy
, intentionally using an iterative approach. The "dummy" function performs the actual work on one single number at a time. The "base" function serves as a dispatcher for an array.
Test implementations compiled with Numba
The following code snippets use numba.vectorized
and numba.guvectorize
, each for targets cpu
, parallel
and cuda
. They are expected to yield results identical to those of the base implementation above.
Verification of results of all functions against base implementation
Just to make sure, the results of all functions are verified against the base implementation.
Benchmark
The actual benchmark looks as follows. It steps through arrays of various sizes and repeats the measurement for each array size a certain number of times. Notice that the garbage collector is deactivated for this benchmark so it can not interfere.
Results and Analysis
For input arrays longer than 10^3, all CPU cores can basically be saturated on target parallel
. Performance scales accordingly. For input arrays longer than 10^4, cuda
is yet a little faster at the upper end than all 24 CPU cores combined. While cuda
suffers heavily on the low end when used with small arrays, which is to be expected, code compiled with target parallel
does not suffer as much. For arrays with less than 8 elements, it is usually only by a factor of 2 slower than a single-threaded solution compiled for target cpu
.
On the CPU side, an almost complete and stable saturation of the available 24 cores can be observed. CUDA in contrast appears to suffer from the relatively simple workload combined with constant transfers of data across the PCIe bus. It still manages to become faster than the CPU though.