ENH: Improve the performance of einsum by using universal simd by Qiyu8 · Pull Request #17049 · numpy/numpy

Qiyu8 · 2020-08-11T09:01:34Z

Rebased from #16641 in order to start a cleanup review , The optimization resulted in a performance improvement of 10%~77% Here is the benchmark result :

X86 machine
AVX512F Enabled

       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         149±3μs          104±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         238±6μs          156±5μs     0.65  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-         247±8μs          149±6μs     0.60  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-        440±10μs         245±10μs     0.56  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

AVX2 Enabled

       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         154±8μs          106±5μs     0.69  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         239±6μs          160±4μs     0.67  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        256±10μs          138±4μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-         450±5μs          225±4μs     0.50  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

SSE2 Enabled

       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         145±4μs          107±2μs     0.74  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         240±2μs          161±6μs     0.67  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        448±10μs          247±5μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
-        253±10μs          137±6μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)

ARM machine
NEON Enabled

       before           after         ratio
     [7aced6e5]       [ad0b3b42]
     <master>         <einsum-usimd>
-      22.0±0.7ms       20.0±0.3ms     0.91  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>)
-       109±0.9μs       99.1±0.8μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>)
-         111±1μs          100±1μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>)
-       159±0.8μs          142±2μs     0.90  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
-       112±0.6μs         96.3±2μs     0.86  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
-       112±0.6μs       95.2±0.3μs     0.85  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>)
-       136±0.4μs        115±0.5μs     0.85  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)
-         518±5μs          363±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        773±20μs         479±10μs     0.62  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
-      17.0±0.4ms       9.77±0.2ms     0.57  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
-         873±4μs          350±4μs     0.40  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
-         554±2μs          200±2μs     0.36  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-        1.02±0ms          233±4μs     0.23  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Co-Author: @seiko2plus

Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd

numpy/core/src/common/simd/avx512/arithmetic.h

Qiyu8 · 2020-08-11T09:05:03Z

@seiko2plus Can you add Power8/9 benchmark result here? Feel free to add more benchmark test cases.

numpy/core/src/common/simd/vsx/arithmetic.h

numpy/core/src/multiarray/einsum_p.h

Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd

eric-wieser · 2020-08-12T09:40:57Z

Here is the benchmark result :

What compiler version are you using? I'm a little worried that we're manually optimizing things that a clever compiler can do for us, but we're benchmarking on a compiler that doesn't. At a glance, GCC 10 emits very similar code to this manual vectorization for a simple for loop.

I think all SIMD benchmarks at a minimum need to state the compiler version for the "before", and should probably include a column with the latest-and-greated compiler for comparison.

Qiyu8 · 2020-08-12T10:12:35Z

in X86 platform, I'm using MSVC Compiler (version 14.26.28801), with the args /arch o2
in ARM platform, I'm using GCC 7.3.0, with -O3 arg
I think that these compilers already do the auto-vector optimization, but is less efficient than manually optimizing.

eric-wieser · 2020-08-12T10:18:36Z

It looks like MSVC 16.3 adds the autovectorized AVX-512 support, so it might be worth trying with that for comparison.
For vectorization to kick in, /fp:fast and/or -ffast-math are required.

Qiyu8 · 2020-08-13T06:50:18Z

I think you mean Visual Studio version 16.3, my Visual studio version is 16.4, so AVX-512 auto vectorizer support is available, Here is the Benchmark result with /arch:AVX512 enabled:

X86 Platform AVX512F enabled

       before           after         ratio
     [00a45b4d]       [ae53e350]
     <master>         <einsum-usimd>
-         144±3μs          101±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         234±6μs          155±2μs     0.66  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-         238±1μs          139±3μs     0.59  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-         430±4μs          235±8μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

mattip

LGTM, a few last comments.

Are the benchmark results at the top of the PR current for the last changeset? I think 2e713b0 is the last commit. Does the size of _multiarray_umath.so change?

benchmarks/benchmarks/bench_linalg.py

numpy/core/src/common/simd/avx512/arithmetic.h

numpy/core/src/multiarray/einsum.c.src

seiko2plus

rev[1/4], improve reduce the sum on X86(SSE, AVX2)

numpy/core/src/common/simd/avx2/arithmetic.h

numpy/core/src/common/simd/sse/arithmetic.h

This patch doesn't cause any performance changes, it just aims to simplify the review process for numpy#17049, according to numpy#17049 (comment)

eric-wieser · 2020-08-21T13:28:23Z

This should be good to rebase now. You should be able to delete einsum_sumprod.c.src and replace it with your current einsum.dispatch.c.src (with some tweaks to the includes).

r-devulap · 2020-08-23T14:42:13Z

From the first glance on the benchmark numbers, on x86 platform it seems to me that AVX2 and AVX-512 isn't providing any speed up relative to SSE. Is it worth adding extra code in the library for no benefit?

Qiyu8 · 2020-09-19T01:14:35Z

@mattip some small arrays got radio of 1.05 after running multiple times, which I think is caused by normal performance jitter.

Qiyu8 · 2020-09-28T02:16:49Z

@eric-wieser Any other Suggestions. Thanks

benchmarks/benchmarks/bench_linalg.py

numpy/core/src/multiarray/einsum_sumprod.c.src

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>

numpy/core/src/multiarray/einsum_sumprod.c.src

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>

Qiyu8 · 2020-10-13T03:12:08Z

@mattip @eric-wieser The last commits are comments and typos modification, no impact on performance.

Qiyu8 · 2020-10-26T03:16:05Z

I can split this PR to 9 following PRs if It's too large to merge.

Add sum intrinsics for all platform.
Add einsum benchmark cases.
Optimizing the sum_of_products_contig_two.
Optimizing the sum_of_products_stride0_contig_outcontig_two.
Optimizing the sum_of_products_contig_stride0_outcontig_two.
Optimizing the sum_of_products_contig_contig_outstride0_two.
Optimizing the sum_of_products_stride0_contig_outstride0_two.
Optimizing the sum_of_products_contig_stride0_outstride0_two.
Optimizing the sum_of_products_contig_outstride0_one.

mattip · 2020-11-04T17:46:47Z

PRs to do (1) and (2) have been merged.

eric-wieser · 2021-01-19T11:00:45Z

What's the status of this PR? Did all the components end up as separate PRs? If so, can we link them here then close this?

Qiyu8 · 2021-01-20T01:09:19Z

@eric-wieser The final part is about to come.

Qiyu8 · 2021-01-21T01:51:35Z

Mission Accomplished, closing now.

Qiyu8 added 4 commits August 11, 2020 12:10

new npyv intrinsics

e26dcf7

einsum dispatch and usimd process

47118fb

update

ad0b3b4

Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd

add float32 benchmark case

55200fc

eric-wieser reviewed Aug 11, 2020

View reviewed changes

numpy/core/src/common/simd/avx512/arithmetic.h Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 11, 2020

View reviewed changes

numpy/core/src/common/simd/vsx/arithmetic.h Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 11, 2020

View reviewed changes

numpy/core/src/multiarray/einsum_p.h Outdated Show resolved Hide resolved

Qiyu8 added 3 commits August 12, 2020 10:48

Merge branch 'master' of github.com:numpy/numpy into einsum-usimd

94cff77

update

4d7d94d

Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd

fix typos

ae53e35

Qiyu8 added 01 - Enhancement component: numpy.einsum component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Aug 13, 2020

add avx512 reduce sum comments

2e713b0

Qiyu8 requested a review from mattip August 14, 2020 06:47

mattip reviewed Aug 19, 2020

View reviewed changes

benchmarks/benchmarks/bench_linalg.py Outdated Show resolved Hide resolved

numpy/core/src/common/simd/avx512/arithmetic.h Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 19, 2020

View reviewed changes

numpy/core/src/multiarray/einsum.c.src Outdated Show resolved Hide resolved

seiko2plus requested changes Aug 19, 2020

View reviewed changes

numpy/core/src/common/simd/avx2/arithmetic.h Show resolved Hide resolved

numpy/core/src/common/simd/avx2/arithmetic.h Show resolved Hide resolved

numpy/core/src/common/simd/sse/arithmetic.h Show resolved Hide resolved

seiko2plus mentioned this pull request Aug 19, 2020

ENH: Add runtime CPU dispatching support for einsum #17107

Closed

eric-wieser mentioned this pull request Aug 19, 2020

MAINT: Split einsum into multiple files #17109

Merged

add non_contigous arrays ,improve reduce the sum

5e7cbd1

remove duplicated message

33b7d2a

mattip requested a review from eric-wieser September 20, 2020 19:50

eric-wieser reviewed Sep 28, 2020

View reviewed changes

benchmarks/benchmarks/bench_linalg.py Outdated Show resolved Hide resolved

numpy/core/src/multiarray/einsum_sumprod.c.src Show resolved Hide resolved

Update benchmarks/benchmarks/bench_linalg.py

5a692ed

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>

eric-wieser reviewed Sep 29, 2020

View reviewed changes

numpy/core/src/multiarray/einsum_sumprod.c.src Outdated Show resolved Hide resolved

eric-wieser reviewed Sep 29, 2020

View reviewed changes

numpy/core/src/multiarray/einsum_sumprod.c.src Show resolved Hide resolved

Qiyu8 and others added 6 commits September 30, 2020 10:31

Update numpy/core/src/multiarray/einsum_sumprod.c.src

20d5cda

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>

Merge branch 'master' of github.com:numpy/numpy into einsum-usimd

83734bf

Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd

f8f7482

Merge branch 'master' of github.com:numpy/numpy into einsum-usimd

1889738

fix typos

7ff7324

remove extra test

73f61c3

mattip requested a review from eric-wieser October 13, 2020 06:38

Qiyu8 added the triage review Issue/PR to be discussed at the next triage meeting label Oct 22, 2020

Qiyu8 marked this pull request as draft October 30, 2020 03:16

Qiyu8 mentioned this pull request Oct 30, 2020

SIMD: Add sum intrinsics for float/double. #17681

Merged

mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020

Qiyu8 mentioned this pull request Nov 16, 2020

SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

Merged

This was referenced Dec 14, 2020

SIMD: Optimize the performance of einsum's submodule dot . #17994

Merged

SIMD: Optimize the performance of einsum's submodule sum. #18012

Merged

Qiyu8 mentioned this pull request Jan 20, 2021

MAINT: einsum: Optimize the sub function two-operands by using SIMD. #18194

Merged

Qiyu8 closed this Jan 21, 2021

Uh oh!

Conversation

Qiyu8 commented Aug 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Qiyu8 commented Aug 11, 2020

Uh oh!

Uh oh!

Uh oh!

eric-wieser commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qiyu8 commented Aug 12, 2020

Uh oh!

eric-wieser commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qiyu8 commented Aug 13, 2020

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric-wieser commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Aug 23, 2020

Uh oh!

Qiyu8 commented Sep 19, 2020

Uh oh!

Qiyu8 commented Sep 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qiyu8 commented Oct 13, 2020

Uh oh!

Qiyu8 commented Oct 26, 2020

Uh oh!

mattip commented Nov 4, 2020

Uh oh!

eric-wieser commented Jan 19, 2021

Uh oh!

Qiyu8 commented Jan 20, 2021

Uh oh!

Qiyu8 commented Jan 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Qiyu8 commented Aug 11, 2020 •

edited

Loading

eric-wieser commented Aug 12, 2020 •

edited

Loading

eric-wieser commented Aug 12, 2020 •

edited

Loading

eric-wieser commented Aug 21, 2020 •

edited

Loading