Skip to content

ENH: Improve the performance of einsum by using universal simd#17049

Closed
Qiyu8 wants to merge 40 commits intonumpy:masterfrom
Qiyu8:einsum-usimd
Closed

ENH: Improve the performance of einsum by using universal simd#17049
Qiyu8 wants to merge 40 commits intonumpy:masterfrom
Qiyu8:einsum-usimd

Conversation

@Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Aug 11, 2020

Rebased from #16641 in order to start a cleanup review , The optimization resulted in a performance improvement of 10%~77% Here is the benchmark result :

  • X86 machine

  • AVX512F Enabled

       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         149±3μs          104±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         238±6μs          156±5μs     0.65  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-         247±8μs          149±6μs     0.60  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-        440±10μs         245±10μs     0.56  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
  • AVX2 Enabled
       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         154±8μs          106±5μs     0.69  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         239±6μs          160±4μs     0.67  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        256±10μs          138±4μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-         450±5μs          225±4μs     0.50  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
  • SSE2 Enabled
       before           after         ratio
     [00a45b4d]       [47118fb6]
     <master>         <einsum-usimd>
-         145±4μs          107±2μs     0.74  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         240±2μs          161±6μs     0.67  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        448±10μs          247±5μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
-        253±10μs          137±6μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
  • ARM machine

  • NEON Enabled

       before           after         ratio
     [7aced6e5]       [ad0b3b42]
     <master>         <einsum-usimd>
-      22.0±0.7ms       20.0±0.3ms     0.91  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>)
-       109±0.9μs       99.1±0.8μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>)
-         111±1μs          100±1μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>)
-       159±0.8μs          142±2μs     0.90  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
-       112±0.6μs         96.3±2μs     0.86  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
-       112±0.6μs       95.2±0.3μs     0.85  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>)
-       136±0.4μs        115±0.5μs     0.85  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)
-         518±5μs          363±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-        773±20μs         479±10μs     0.62  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
-      17.0±0.4ms       9.77±0.2ms     0.57  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
-         873±4μs          350±4μs     0.40  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
-         554±2μs          200±2μs     0.36  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-        1.02±0ms          233±4μs     0.23  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Co-Author: @seiko2plus

Qiyu8 added 4 commits August 11, 2020 12:10
Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd
@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 11, 2020

@seiko2plus Can you add Power8/9 benchmark result here? Feel free to add more benchmark test cases.

Qiyu8 added 3 commits August 12, 2020 10:48
Merge branch 'einsum-usimd' of github.com:Qiyu8/numpy into einsum-usimd
@eric-wieser
Copy link
Member

eric-wieser commented Aug 12, 2020

Here is the benchmark result :

What compiler version are you using? I'm a little worried that we're manually optimizing things that a clever compiler can do for us, but we're benchmarking on a compiler that doesn't. At a glance, GCC 10 emits very similar code to this manual vectorization for a simple for loop.

I think all SIMD benchmarks at a minimum need to state the compiler version for the "before", and should probably include a column with the latest-and-greated compiler for comparison.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 12, 2020

in X86 platform, I'm using MSVC Compiler (version 14.26.28801), with the args /arch o2
in ARM platform, I'm using GCC 7.3.0, with -O3 arg
I think that these compilers already do the auto-vector optimization, but is less efficient than manually optimizing.

@eric-wieser
Copy link
Member

eric-wieser commented Aug 12, 2020

It looks like MSVC 16.3 adds the autovectorized AVX-512 support, so it might be worth trying with that for comparison.
For vectorization to kick in, /fp:fast and/or -ffast-math are required.

@Qiyu8 Qiyu8 added 01 - Enhancement component: numpy.einsum component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Aug 13, 2020
@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 13, 2020

I think you mean Visual Studio version 16.3, my Visual studio version is 16.4, so AVX-512 auto vectorizer support is available, Here is the Benchmark result with /arch:AVX512 enabled:

  • X86 Platform AVX512F enabled
       before           after         ratio
     [00a45b4d]       [ae53e350]
     <master>         <einsum-usimd>
-         144±3μs          101±2μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-         234±6μs          155±2μs     0.66  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-         238±1μs          139±3μs     0.59  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
-         430±4μs          235±8μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@Qiyu8 Qiyu8 requested a review from mattip August 14, 2020 06:47
Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few last comments.

Are the benchmark results at the top of the PR current for the last changeset? I think 2e713b0 is the last commit. Does the size of _multiarray_umath.so change?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rev[1/4], improve reduce the sum on X86(SSE, AVX2)

seiko2plus added a commit to seiko2plus/numpy that referenced this pull request Aug 19, 2020
  This patch doesn't cause any performance changes,
  it just aims to simplify the review process for numpy#17049,
  according to numpy#17049 (comment)
@eric-wieser
Copy link
Member

eric-wieser commented Aug 21, 2020

This should be good to rebase now. You should be able to delete einsum_sumprod.c.src and replace it with your current einsum.dispatch.c.src (with some tweaks to the includes).

@r-devulap
Copy link
Member

From the first glance on the benchmark numbers, on x86 platform it seems to me that AVX2 and AVX-512 isn't providing any speed up relative to SSE. Is it worth adding extra code in the library for no benefit?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Sep 19, 2020

@mattip some small arrays got radio of 1.05 after running multiple times, which I think is caused by normal performance jitter.

@mattip mattip requested a review from eric-wieser September 20, 2020 19:50
@Qiyu8
Copy link
Member Author

Qiyu8 commented Sep 28, 2020

@eric-wieser Any other Suggestions. Thanks

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
@Qiyu8
Copy link
Member Author

Qiyu8 commented Oct 13, 2020

@mattip @eric-wieser The last commits are comments and typos modification, no impact on performance.

@mattip mattip requested a review from eric-wieser October 13, 2020 06:38
@Qiyu8 Qiyu8 added the triage review Issue/PR to be discussed at the next triage meeting label Oct 22, 2020
@Qiyu8
Copy link
Member Author

Qiyu8 commented Oct 26, 2020

I can split this PR to 9 following PRs if It's too large to merge.

  1. Add sum intrinsics for all platform.
  2. Add einsum benchmark cases.
  3. Optimizing the sum_of_products_contig_two.
  4. Optimizing the sum_of_products_stride0_contig_outcontig_two.
  5. Optimizing the sum_of_products_contig_stride0_outcontig_two.
  6. Optimizing the sum_of_products_contig_contig_outstride0_two.
  7. Optimizing the sum_of_products_stride0_contig_outstride0_two.
  8. Optimizing the sum_of_products_contig_stride0_outstride0_two.
  9. Optimizing the sum_of_products_contig_outstride0_one.

@Qiyu8 Qiyu8 marked this pull request as draft October 30, 2020 03:16
@mattip
Copy link
Member

mattip commented Nov 4, 2020

PRs to do (1) and (2) have been merged.

@eric-wieser
Copy link
Member

What's the status of this PR? Did all the components end up as separate PRs? If so, can we link them here then close this?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Jan 20, 2021

@eric-wieser The final part is about to come.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Jan 21, 2021

Mission Accomplished, closing now.

@Qiyu8 Qiyu8 closed this Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: numpy.einsum component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants