MAINT: Avx512 intrinsics implementation for float64 input np.log by xiegengxin · Pull Request #16605 · numpy/numpy

xiegengxin · 2020-06-15T09:29:22Z

This patch implements np.log by using avx512 intrinsics when input is float64. The algorithm references from: [1] Tang, Ping Tak Peter. Table-lookup algorithms for elementary functions and their error analysis. No. CONF-9106103-1. Argonne National Lab., IL (USA), 1991. and [2] Muller, Jean-Michel. "Elementary functions: algorithms and implementation." (2016).

Performance
The performance is tested by some cases under

numpy/benchmarks/benchmarks/bench_avx.py

Line 25 in f3ff68e

class AVX_UFunc(Benchmark):

and

numpy/benchmarks/benchmarks/bench_avx.py

Line 154 in f3ff68e

class LogisticRegression(Benchmark):

Compare with master branch (scalar version by calling glibc's function), the new implementation has about 6x faster. The speed of train LogisticRegression has about 1.2x than before.

item	master	avx512-log-float64	ratio
bench_avx.LogisticRegression.time_train(<class 'numpy.float64'>)	4.66±0.01s	3.87±0.01s	0.83
bench_avx.AVX_UFunc_log.time_log(2, 'd')	211±1μs	36.7±0.5μs	0.17
bench_avx.AVX_UFunc_log.time_log(4, 'd')	212±0.9μs	36.2±0.3μs	0.17
bench_avx.AVX_UFunc_log.time_log(1, 'd')	212±1μs	35.1±1μs	0.17

Accuracy
Tang's paper has analysed the error of his algorithm, it is less than 1ulp.

module size
_multiarray_umath.so module has grown about 24 kB after adding this PR.

benchmarks/benchmarks/bench_avx.py

numpy/core/src/umath/npy_simd_data.h

numpy/core/src/umath/simd.inc.src

sturlamolden · 2020-06-16T03:50:01Z

How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler?

This code is very specific for Avx512 and will not work on CPUs that do not have this. It will not work with new SIMD and better extensions in the future. The syntax of Avx512 as with the rest of mmintrin.h is prefix and basically unreadable.

GNU vector extensions work everywhere, except with a certain compiler from Redmond. They are not limited to the Avx512 instruction set. They also provide a nice and readable infix syntax with normal C operators, so we can actually understand what the code is doing. That improves on the maintainability manifold.

We have used GNU extensions for vectorization in e.g. scipy.spatial, with the logic that those who prefer to use Microsoft’s compiler will just have to live with the consequences.

mattip · 2020-06-16T05:40:49Z

How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler?

These kinds of PRs go beyond loop-unrolling and autovectorization. NEP 38 goes into more detail about the approach we will be taking. Universal intrinsics work across platforms, and the approach has proven itself useful in end-user projects like OpenCV and GNU Radio, as well as many libraries that provide high-level operations but use intrinsics underneath like sleef (used by pytorch).

sturlamolden · 2020-06-16T08:10:28Z

How about using GNU vector extensions or autovectorization pragmas instead and leave the specifics of the SIMD dialect to the compiler?

These kinds of PRs go beyond loop-unrolling and autovectorization. NEP 38 goes into more detail about the approach we will be taking. Universal intrinsics work across platforms, and the approach has proven itself useful in end-user projects like OpenCV and GNU Radio, as well as many libraries that provide high-level operations but use intrinsics underneath like sleef (used by pytorch).

That is exactly my point. There is no use of universal intrinsics in this PR.

Qiyu8 · 2020-06-16T08:32:22Z

@sturlamolden Although Universal SIMD is about to merge in few weeks, the value needs to be proven by new implement, and you have to considering the migration cost, so currently we choose to have a solid reference implementation in place, just like the other SIMD PRs.

xiegengxin · 2020-07-01T01:47:02Z

@r-devulap Could you review this codes? Pls.

mattip · 2020-07-03T07:09:32Z

For completeness, could you rerun the benchmarks and also comapre the size of the shared library before and after this PR?

xiegengxin · 2020-07-08T05:28:39Z

For completeness, could you rerun the benchmarks and also comapre the size of the shared library before and after this PR?

The performance

item	master	avx512-log-float64	ratio
bench_avx.LogisticRegression.time_train(<class 'numpy.float64'>)	4.63±0s	3.84±0.01s	0.83
bench_avx.AVX_UFunc_log.time_log(4, 'd') -	212±1μs	37.2±0.2μs	0.18
bench_avx.AVX_UFunc_log.time_log(2, 'd')	212±0.4μs	36.4±0.1μs	0.17
bench_avx.AVX_UFunc_log.time_log(1, 'd')	212±0.9μs	36.1±1μs	0.17

Still about 6x.

The size of shared library has grown about 23kB.

mattip · 2020-07-08T10:44:30Z

Will give a little more time for other reviews. @r-devulap, @seiko2plus: any thoughts?

sturlamolden · 2020-07-08T11:16:50Z

How does this compare to just using vector math libraries, e.g. when compiling with MKL, Accelerate, ACML, etc.? It seems it duplicates a tiny portion of what Intel, Apple, AMD engineers have already implemented (not just for AVX). A vector math solution would also work for any trancendental function, not justb the logarithm. IIRC, Numexpr already does this with MKL.

mattip · 2020-07-08T11:33:55Z

For good and for bad, we chose to implement the ufunc SIMD routines ourselves rather than to re-use libraries listed in NEP 38. A future NEP may propose to rethink this decision, but would need to take into account the functionality needed for the various ufuncs (including scatter-gather operations for strided arrays) and the variety of platforms NumPy supports.

sturlamolden · 2020-07-08T12:01:30Z

To use vector math libraries efficiently with strided or badly aligned data one would have to copy the data into a contiguous and aligned work array. For efficient use of cache it needs to be rather small.

r-devulap · 2020-07-09T00:10:13Z

@mattip I recommended making some changes which should improve performance further but looks good otherwise.

mattip · 2020-07-14T07:36:34Z

Thanks @xiegengxin

xiegengxin added 2 commits June 12, 2020 14:29

MAINT: AVX512 intrinsics implementation for float64 input np.log

c457a75

TEST: Add log strided input test

334b59d

Qiyu8 added component: SIMD Issues in SIMD (fast instruction sets) code or machinery 01 - Enhancement labels Jun 15, 2020

Qiyu8 requested changes Jun 16, 2020

View reviewed changes

benchmarks/benchmarks/bench_avx.py Show resolved Hide resolved

numpy/core/src/umath/npy_simd_data.h Show resolved Hide resolved

numpy/core/src/umath/simd.inc.src Show resolved Hide resolved

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

MAINT: Remove AVX512DQ intrinsics

be73f96

xiegengxin added 3 commits June 17, 2020 10:31

MAINT: Add more comments and using loop to load lookup table

452ae20

MAINT: Bracket lookup table data

069d7e4

MAINT: Fix some spelling mistakes

a1ef8a3

Qiyu8 approved these changes Jul 3, 2020

View reviewed changes

mattip approved these changes Jul 8, 2020

View reviewed changes

mattip merged commit c02d020 into numpy:master Jul 14, 2020

mattip mentioned this pull request Jul 14, 2020

MAINT: fix new compiler warnings on clang #16857

Merged

mattip mentioned this pull request Oct 13, 2020

AVX512F_log_DOUBLE is much slower than it needs to be #17551

Open

charris mentioned this pull request Feb 8, 2021

In certain case, v1.20.0 numpy.log returns a slightly different value than v1.19.5 #18371

Closed

r-devulap mentioned this pull request Aug 18, 2021

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

Merged

Uh oh!

Conversation

xiegengxin commented Jun 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sturlamolden commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Jun 16, 2020

Uh oh!

sturlamolden commented Jun 16, 2020

Uh oh!

Qiyu8 commented Jun 16, 2020

Uh oh!

xiegengxin commented Jul 1, 2020

Uh oh!

mattip commented Jul 3, 2020

Uh oh!

xiegengxin commented Jul 8, 2020

Uh oh!

mattip commented Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sturlamolden commented Jul 8, 2020

Uh oh!

mattip commented Jul 8, 2020

Uh oh!

sturlamolden commented Jul 8, 2020

Uh oh!

r-devulap commented Jul 9, 2020

Uh oh!

mattip commented Jul 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sturlamolden commented Jun 16, 2020 •

edited

Loading

mattip commented Jul 8, 2020 •

edited

Loading