ENH: Improve the performance of einsum by using universal simd#17049
ENH: Improve the performance of einsum by using universal simd#17049Qiyu8 wants to merge 40 commits intonumpy:masterfrom
Conversation
|
@seiko2plus Can you add Power8/9 benchmark result here? Feel free to add more benchmark test cases. |
What compiler version are you using? I'm a little worried that we're manually optimizing things that a clever compiler can do for us, but we're benchmarking on a compiler that doesn't. At a glance, GCC 10 emits very similar code to this manual vectorization for a simple for loop. I think all SIMD benchmarks at a minimum need to state the compiler version for the "before", and should probably include a column with the latest-and-greated compiler for comparison. |
|
in X86 platform, I'm using MSVC Compiler (version 14.26.28801), with the args |
|
It looks like MSVC 16.3 adds the autovectorized AVX-512 support, so it might be worth trying with that for comparison. |
|
I think you mean Visual Studio version 16.3, my Visual studio version is 16.4, so AVX-512 auto vectorizer support is available, Here is the Benchmark result with
|
seiko2plus
left a comment
There was a problem hiding this comment.
rev[1/4], improve reduce the sum on X86(SSE, AVX2)
This patch doesn't cause any performance changes, it just aims to simplify the review process for numpy#17049, according to numpy#17049 (comment)
|
This should be good to rebase now. You should be able to delete |
|
From the first glance on the benchmark numbers, on x86 platform it seems to me that AVX2 and AVX-512 isn't providing any speed up relative to SSE. Is it worth adding extra code in the library for no benefit? |
|
@mattip some small arrays got radio of 1.05 after running multiple times, which I think is caused by normal performance jitter. |
|
@eric-wieser Any other Suggestions. Thanks |
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
|
@mattip @eric-wieser The last commits are comments and typos modification, no impact on performance. |
|
I can split this PR to 9 following PRs if It's too large to merge.
|
|
PRs to do (1) and (2) have been merged. |
|
What's the status of this PR? Did all the components end up as separate PRs? If so, can we link them here then close this? |
|
@eric-wieser The final part is about to come. |
|
Mission Accomplished, closing now. |
Rebased from #16641 in order to start a cleanup review , The optimization resulted in a performance improvement of 10%~77% Here is the benchmark result :
X86 machine
AVX512F Enabled
ARM machine
NEON Enabled
Co-Author: @seiko2plus