Skip to content

MAINT: Avx512 intrinsics implementation for float64 input np.log#16605

Merged
mattip merged 6 commits intonumpy:masterfrom
xiegengxin:avx512-log-float64
Jul 14, 2020
Merged

MAINT: Avx512 intrinsics implementation for float64 input np.log#16605
mattip merged 6 commits intonumpy:masterfrom
xiegengxin:avx512-log-float64

Conversation

@xiegengxin
Copy link
Contributor

This patch implements np.log by using avx512 intrinsics when input is float64. The algorithm references from: [1] Tang, Ping Tak Peter. Table-lookup algorithms for elementary functions and their error analysis. No. CONF-9106103-1. Argonne National Lab., IL (USA), 1991. and [2] Muller, Jean-Michel. "Elementary functions: algorithms and implementation." (2016).

Performance
The performance is tested by some cases under

class AVX_UFunc(Benchmark):
and
class LogisticRegression(Benchmark):

Compare with master branch (scalar version by calling glibc's function), the new implementation has about 6x faster. The speed of train LogisticRegression has about 1.2x than before.

item master avx512-log-float64 ratio
bench_avx.LogisticRegression.time_train(<class 'numpy.float64'>) 4.66±0.01s 3.87±0.01s 0.83
bench_avx.AVX_UFunc_log.time_log(2, 'd') 211±1μs 36.7±0.5μs 0.17
bench_avx.AVX_UFunc_log.time_log(4, 'd') 212±0.9μs 36.2±0.3μs 0.17
bench_avx.AVX_UFunc_log.time_log(1, 'd') 212±1μs 35.1±1μs 0.17

Accuracy
Tang's paper has analysed the error of his algorithm, it is less than 1ulp.

module size
_multiarray_umath.so module has grown about 24 kB after adding this PR.

@Qiyu8 Qiyu8 added component: SIMD Issues in SIMD (fast instruction sets) code or machinery 01 - Enhancement labels Jun 15, 2020
@sturlamolden
Copy link
Contributor

sturlamolden commented Jun 16, 2020

How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler?

This code is very specific for Avx512 and will not work on CPUs that do not have this. It will not work with new SIMD and better extensions in the future. The syntax of Avx512 as with the rest of mmintrin.h is prefix and basically unreadable.

GNU vector extensions work everywhere, except with a certain compiler from Redmond. They are not limited to the Avx512 instruction set. They also provide a nice and readable infix syntax with normal C operators, so we can actually understand what the code is doing. That improves on the maintainability manifold.

We have used GNU extensions for vectorization in e.g. scipy.spatial, with the logic that those who prefer to use Microsoft’s compiler will just have to live with the consequences.

@mattip
Copy link
Member

mattip commented Jun 16, 2020

How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler?

These kinds of PRs go beyond loop-unrolling and autovectorization. NEP 38 goes into more detail about the approach we will be taking. Universal intrinsics work across platforms, and the approach has proven itself useful in end-user projects like OpenCV and GNU Radio, as well as many libraries that provide high-level operations but use intrinsics underneath like sleef (used by pytorch).

@sturlamolden
Copy link
Contributor

How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler?

These kinds of PRs go beyond loop-unrolling and autovectorization. NEP 38 goes into more detail about the approach we will be taking. Universal intrinsics work across platforms, and the approach has proven itself useful in end-user projects like OpenCV and GNU Radio, as well as many libraries that provide high-level operations but use intrinsics underneath like sleef (used by pytorch).

That is exactly my point. There is no use of universal intrinsics in this PR.

@Qiyu8
Copy link
Member

Qiyu8 commented Jun 16, 2020

@sturlamolden Although Universal SIMD is about to merge in few weeks, the value needs to be proven by new implement, and you have to considering the migration cost, so currently we choose to have a solid reference implementation in place, just like the other SIMD PRs.

@xiegengxin
Copy link
Contributor Author

@r-devulap Could you review this codes? Pls.

@mattip
Copy link
Member

mattip commented Jul 3, 2020

For completeness, could you rerun the benchmarks and also comapre the size of the shared library before and after this PR?

@xiegengxin
Copy link
Contributor Author

For completeness, could you rerun the benchmarks and also comapre the size of the shared library before and after this PR?

The performance

item master avx512-log-float64 ratio
bench_avx.LogisticRegression.time_train(<class 'numpy.float64'>) 4.63±0s 3.84±0.01s 0.83
bench_avx.AVX_UFunc_log.time_log(4, 'd') - 212±1μs 37.2±0.2μs 0.18
bench_avx.AVX_UFunc_log.time_log(2, 'd') 212±0.4μs 36.4±0.1μs 0.17
bench_avx.AVX_UFunc_log.time_log(1, 'd') 212±0.9μs 36.1±1μs 0.17

Still about 6x.

The size of shared library has grown about 23kB.

@mattip
Copy link
Member

mattip commented Jul 8, 2020

Will give a little more time for other reviews. @r-devulap, @seiko2plus: any thoughts?

@sturlamolden
Copy link
Contributor

How does this compare to just using vector math libraries, e.g. when compiling with MKL, Accelerate, ACML, etc.? It seems it duplicates a tiny portion of what Intel, Apple, AMD engineers have already implemented (not just for AVX). A vector math solution would also work for any trancendental function, not justb the logarithm. IIRC, Numexpr already does this with MKL.

@mattip
Copy link
Member

mattip commented Jul 8, 2020

For good and for bad, we chose to implement the ufunc SIMD routines ourselves rather than to re-use libraries listed in NEP 38. A future NEP may propose to rethink this decision, but would need to take into account the functionality needed for the various ufuncs (including scatter-gather operations for strided arrays) and the variety of platforms NumPy supports.

@sturlamolden
Copy link
Contributor

To use vector math libraries efficiently with strided or badly aligned data one would have to copy the data into a contiguous and aligned work array. For efficient use of cache it needs to be rather small.

@r-devulap
Copy link
Member

@mattip I recommended making some changes which should improve performance further but looks good otherwise.

@mattip mattip merged commit c02d020 into numpy:master Jul 14, 2020
@mattip
Copy link
Member

mattip commented Jul 14, 2020

Thanks @xiegengxin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants