ENH: Vectorize FP16 umath functions using AVX512#21955
Conversation
|
Might be useful to add a new CI test to run this new content on Intel SDE. |
|
I have one question before I go further in reviewing this pr. Does the SVML implementation use any of |
|
Reworked the patch to work on AVX512. Perf numbers for FP16 functions look great with a 33x - 65x speed up (on SkylakeX) depending on the function: |
|
PR #21954 adds comprehensive test coverage for these math functions. I will rebase this PR once that is merged. |
seiko2plus
left a comment
There was a problem hiding this comment.
LGTM, just requires moving the new intrinsics to the source loops_umath_fp.dispatch.c.src instead
|
ping .. |
|
@r-devulap, Would you please respond to #21955 (comment)? if you disagree then there're a few changes that will need to be done. |
| const npy_intp ssrc = steps[0] / lsize; | ||
| const npy_intp sdst = steps[1] / lsize; | ||
| const npy_intp len = dimensions[0]; | ||
| if ((ssrc == 1) && (sdst == 1)) { |
There was a problem hiding this comment.
checking memory overlap is still required even with contiguous strides, also are there any reasons for not supporting non-contiguous memory access?
There was a problem hiding this comment.
Makes sense. Added it. AFAIK there is no gather/scatter instruction for 16-bit dtype.
There was a problem hiding this comment.
x86 gather/scatter supports 16-bit offset. theoretically, it can be emulated via two gather/scatter calls for each full memory access.
67212ee to
7dfcd39
Compare
|
Maybe this should get a release note? Or should we try to summarize all the SIMD changes in one note for the release? |
|
Thanks @r-devulap |
I just note these as "continuing SIMD improvements" :) A release note wouldn't hurt as FP16 improvements are new. |
This patch leverages the
vcvtps2ph,vcvtpd2psinstructions and float32 SVML functions to accelerate float16 umath functions. Max ULP error < 1 for all the math functions.