USIMD: Optimize the performace of np.einsum for all platforms#16641
USIMD: Optimize the performace of np.einsum for all platforms#16641Qiyu8 wants to merge 256 commits intonumpy:masterfrom
Conversation
Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon
These never worked to the best of my knowledge, so disable it explicitly to hopefully simplify cleanup in other parts.
|
We discussed this briefly in the sprint. Correct me if I am wrong: this is a reimplementation of the SSE2 intrinsics in NEON. I wonder if we could already use the universal intrinsics to make this generic? |
Floats besides float64 were being coerced to integers and complex step sizes for the index trick classes would fail for complex64 input. Fixes numpy#16466
|
I'd be happy to change it to universal intrinsics,Is there any guidance document or example that I should read first? @mattip @seiko2plus |
not yet, but sure I can guide you |
Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon
| /**begin repeat2 | ||
| * #i = 0, 2, 4, 6# | ||
| */ | ||
| a = vmulq_f64(vld1q_f64(data0+@i@), vld1q_f64(data1+@i@)); |
There was a problem hiding this comment.
replace these intrinsics with:
vld1q_f64 -> npyv_load_f64
vmulq_f64 -> npyv_mul_f64
vaddq_f64 -> npyv_add_f64
vst1q_f64 -> npyv_store_f64
for more intrinsics, you have to search or dive in the NPYV
There was a problem hiding this comment.
BTW why you didn't use FMA3 here, well currently there's no mul-add intrinsic in NPYV but I guess It will not be that hard to impmlent it.
There was a problem hiding this comment.
I will make another PR for mul-add intrinsic in NPYV.
… of github.com:numpy/numpy into einsum-neon
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
|
@seiko2plus Thanks for your advice, Reconstruct I just make PS: USIMD is really powerful, saved so much coding time. 😃 |
mattip
left a comment
There was a problem hiding this comment.
nice to see use of the universal intrinsics.
.gitignore
Outdated
| numpy/core/src/umath/_umath_tests.dispatch.h | ||
| numpy/core/src/umath/_umath_tests.dispatch.sse41.c | ||
| numpy/core/src/*/*.dispatch.*.c | ||
| numpy/core/src/*/*.dispatch.h |
There was a problem hiding this comment.
@seiko2plus shouldn't we be generating files in build/* rather than in the source directories, or is this similar to cython-generated files?
There was a problem hiding this comment.
yes, should be generated into build/src.*, but maybe @Qiyu8 use --inplace.
There was a problem hiding this comment.
yes, I used -i option during build.
There was a problem hiding this comment.
I guess you used inplace option -i because you received error like that
: fatal error: einsum.dispatch.h: No such file or directorywell it's my bad, CcompilerOpt should include build/src.*/numpy/core/src/multiarray to the search dir, even if it wasn't added by 'core/setup.py'.
This issue should be fixed by #16858.
There was a problem hiding this comment.
I think we should keep this ignore part, I build the source with -i option, and the *.dispatch.*.c will generated.
There was a problem hiding this comment.
By default people should not be using -i since it messes up the source directory, as you discovered. Why don't you want to use the build directory? Is there another problem that using -i fixes?
There was a problem hiding this comment.
build directory is good , no problem if I don't use -i option. will remove from ignore list.
| self.c1 = np.arange(1200, dtype=dtype).reshape(30, 40) | ||
| self.c2 = np.arange(480000, dtype=dtype) | ||
| self.c3 = np.arange(600, dtype=dtype) | ||
| self.d = np.arange(10000, dtype=dtype).reshape(10,100,10) |
There was a problem hiding this comment.
It would be good at some point to rename these with meaningful names. Is there a reason to have the different size variants? Perhaps a very small variant would be good as well.
There was a problem hiding this comment.
@mattip I found that meaningless naming is pretty common in every bench module, should we open a separate rename PR? I want to focus on the universal intrinsics transform process in this PR.
There was a problem hiding this comment.
ok. What about adding a very small variant as well?
There was a problem hiding this comment.
I doubt that small variant is not enough to evaluate the degree of optimization, I will test with small size and provide a report anyway.
|
@Qiyu8, you should dispatch the tabs that hold the functions pointers within |
Co-authored-by: Sayed Adel <seiko@imavr.com>
- paves the way for integer optimization - fix memory overflow/bus errors - improve the unrolling - activate the unrolling when NPYV isn't available - add support for fma3 - robust/cleanup - other minor fixes
- paves the way for integer optimization - fix memory overflow/bus errors - improve the unrolling - activate the unrolling when NPYV isn't available - add support for fma3 - robust/cleanup - other minor fixes
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
seven sub module of np.einsum is optimized using neon intrinsics. The optimization resulted in a performance improvement of 10~42%.
System info:
The optimization effect: