MAINT: Avx512 intrinsics implementation for float64 input np.log#16605
MAINT: Avx512 intrinsics implementation for float64 input np.log#16605mattip merged 6 commits intonumpy:masterfrom
Conversation
|
How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler? This code is very specific for Avx512 and will not work on CPUs that do not have this. It will not work with new SIMD and better extensions in the future. The syntax of Avx512 as with the rest of mmintrin.h is prefix and basically unreadable. GNU vector extensions work everywhere, except with a certain compiler from Redmond. They are not limited to the Avx512 instruction set. They also provide a nice and readable infix syntax with normal C operators, so we can actually understand what the code is doing. That improves on the maintainability manifold. We have used GNU extensions for vectorization in e.g. scipy.spatial, with the logic that those who prefer to use Microsoft’s compiler will just have to live with the consequences. |
These kinds of PRs go beyond loop-unrolling and autovectorization. NEP 38 goes into more detail about the approach we will be taking. Universal intrinsics work across platforms, and the approach has proven itself useful in end-user projects like OpenCV and GNU Radio, as well as many libraries that provide high-level operations but use intrinsics underneath like sleef (used by pytorch). |
That is exactly my point. There is no use of universal intrinsics in this PR. |
|
@sturlamolden Although Universal SIMD is about to merge in few weeks, the value needs to be proven by new implement, and you have to considering the migration cost, so currently we choose to have a solid reference implementation in place, just like the other SIMD PRs. |
|
@r-devulap Could you review this codes? Pls. |
|
For completeness, could you rerun the benchmarks and also comapre the size of the shared library before and after this PR? |
The performance
Still about 6x. The size of shared library has grown about 23kB. |
|
Will give a little more time for other reviews. @r-devulap, @seiko2plus: any thoughts? |
|
How does this compare to just using vector math libraries, e.g. when compiling with MKL, Accelerate, ACML, etc.? It seems it duplicates a tiny portion of what Intel, Apple, AMD engineers have already implemented (not just for AVX). A vector math solution would also work for any trancendental function, not justb the logarithm. IIRC, Numexpr already does this with MKL. |
|
For good and for bad, we chose to implement the ufunc SIMD routines ourselves rather than to re-use libraries listed in NEP 38. A future NEP may propose to rethink this decision, but would need to take into account the functionality needed for the various ufuncs (including scatter-gather operations for strided arrays) and the variety of platforms NumPy supports. |
|
To use vector math libraries efficiently with strided or badly aligned data one would have to copy the data into a contiguous and aligned work array. For efficient use of cache it needs to be rather small. |
|
@mattip I recommended making some changes which should improve performance further but looks good otherwise. |
|
Thanks @xiegengxin |
This patch implements np.log by using avx512 intrinsics when input is float64. The algorithm references from: [1] Tang, Ping Tak Peter. Table-lookup algorithms for elementary functions and their error analysis. No. CONF-9106103-1. Argonne National Lab., IL (USA), 1991. and [2] Muller, Jean-Michel. "Elementary functions: algorithms and implementation." (2016).
Performance
The performance is tested by some cases under
numpy/benchmarks/benchmarks/bench_avx.py
Line 25 in f3ff68e
numpy/benchmarks/benchmarks/bench_avx.py
Line 154 in f3ff68e
Compare with master branch (scalar version by calling glibc's function), the new implementation has about 6x faster. The speed of train LogisticRegression has about 1.2x than before.
Accuracy
Tang's paper has analysed the error of his algorithm, it is less than 1ulp.
module size
_multiarray_umath.so module has grown about 24 kB after adding this PR.