ENH: Vectorize FP16 umath functions using AVX512 by r-devulap · Pull Request #21955 · numpy/numpy

r-devulap · 2022-07-08T21:51:00Z

This patch leverages the vcvtps2ph, vcvtpd2ps instructions and float32 SVML functions to accelerate float16 umath functions. Max ULP error < 1 for all the math functions.

r-devulap · 2022-07-08T22:13:23Z

Might be useful to add a new CI test to run this new content on Intel SDE.

seiko2plus · 2022-07-14T05:24:53Z

I have one question before I go further in reviewing this pr. Does the SVML implementation use any of AVX512FP16 instruction set or it just count on single-precision operations? if no then there's no need for it at least for now, and AVX512f -> _mm512_cvtph_ps/_mm512_cvtps_ph should provide the same performance.

r-devulap · 2022-07-27T19:24:29Z

Reworked the patch to work on AVX512. Perf numbers for FP16 functions look great with a 33x - 65x speed up (on SkylakeX) depending on the function:

       before           after         ratio
     [6e155790]       [901eb7e1]
     <main>           <fp16-umath>
-        1.46±0ms       45.6±0.3μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'e')
-        1.92±0ms      56.1±0.08μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arccos'>, 1, 1, 'e')
-        1.77±0ms       51.5±0.7μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 1, 1, 'e')
-        1.49±0ms       42.4±0.1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'e')
-     3.38±0.01ms         96.4±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsinh'>, 1, 1, 'e')
-        2.23±0ms         61.7±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log1p'>, 1, 1, 'e')
-        2.15±0ms      58.2±0.09μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arctan'>, 1, 1, 'e')
-     3.19±0.01ms         86.1±1μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arccosh'>, 1, 1, 'e')
-        1.25±0ms       31.2±0.5μs     0.03  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log'>, 1, 1, 'e')
-        1.19±0ms       29.6±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'e')
-        1.26±0ms      29.1±0.05μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 1, 1, 'e')
-        2.56±0ms       59.3±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tan'>, 1, 1, 'e')
-     3.45±0.01ms      75.3±0.09μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arctanh'>, 1, 1, 'e')
-        1.18±0ms      24.2±0.06μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 1, 'e')
-        1.62±0ms      30.7±0.05μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 1, 1, 'e')
-     3.13±0.01ms         54.5±3μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'e')
-        2.38±0ms       41.0±0.3μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cosh'>, 1, 1, 'e')
-     3.09±0.01ms       49.9±0.2μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sinh'>, 1, 1, 'e')
-        2.26±0ms       35.8±0.1μs     0.02  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'expm1'>, 1, 1, 'e')
-        3.16±0ms       47.2±0.1μs     0.01  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cbrt'>, 1, 1, 'e')

r-devulap · 2022-07-27T20:29:38Z

PR #21954 adds comprehensive test coverage for these math functions. I will rebase this PR once that is merged.

seiko2plus

LGTM, just requires moving the new intrinsics to the source loops_umath_fp.dispatch.c.src instead

numpy/core/src/common/simd/avx512/conversion.h

r-devulap · 2022-09-15T21:57:19Z

ping ..

seiko2plus · 2022-09-18T04:04:50Z

@r-devulap, Would you please respond to #21955 (comment)? if you disagree then there're a few changes that will need to be done.

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

seiko2plus · 2022-09-25T16:49:42Z

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

+    const npy_intp ssrc = steps[0] / lsize;
+    const npy_intp sdst = steps[1] / lsize;
+    const npy_intp len = dimensions[0];
+    if ((ssrc == 1) && (sdst == 1)) {


checking memory overlap is still required even with contiguous strides, also are there any reasons for not supporting non-contiguous memory access?

Makes sense. Added it. AFAIK there is no gather/scatter instruction for 16-bit dtype.

x86 gather/scatter supports 16-bit offset. theoretically, it can be emulated via two gather/scatter calls for each full memory access.

seiko2plus

LGTM, Thank you

mattip · 2022-09-28T08:11:49Z

Maybe this should get a release note? Or should we try to summarize all the SIMD changes in one note for the release?

mattip · 2022-09-28T08:12:01Z

Thanks @r-devulap

charris · 2022-09-28T13:02:49Z

Maybe this should get a release note?

I just note these as "continuing SIMD improvements" :) A release note wouldn't hurt as FP16 improvements are new.

github-actions bot added the 01 - Enhancement label Jul 8, 2022

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022

r-devulap force-pushed the fp16-umath branch from fa1bf6d to 6beecb3 Compare July 27, 2022 19:20

r-devulap changed the title ~~ENH: Vectorize FP16 umath functions using AVX512FP16 ISA~~ ENH: Vectorize FP16 umath functions using AVX512 Jul 27, 2022

seiko2plus requested changes Aug 17, 2022

View reviewed changes

numpy/core/src/common/simd/avx512/conversion.h Outdated Show resolved Hide resolved

seiko2plus requested changes Sep 25, 2022

View reviewed changes

r-devulap force-pushed the fp16-umath branch from 67212ee to 7dfcd39 Compare September 26, 2022 16:44

Raghuveer Devulapalli added 6 commits September 26, 2022 09:52

BENCH: Add benchmarks for fp16 umath functions

6070491

SIMD: Add universal intrinsic support for SIMD float16 using AVX-512

0e87c99

ENH: Vectorize FP16 math functions on Intel SkylakeX

8dd6761

MAINT: Fix linter error

80f0015

MAINT: Move AVX512 fp16 universal intrinsic to dispatch file

a13006a

BUG: Add memoverlap check

7dfcd39

seiko2plus approved these changes Sep 27, 2022

View reviewed changes

mattip merged commit 2dfd21e into numpy:main Sep 28, 2022

mattip mentioned this pull request Oct 23, 2022

BUG: RuntimeWarning on ARM64 but not x86_64 for cosine of float32 NaNs #22461

Closed

rgommers mentioned this pull request Nov 15, 2022

SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics #20363

Merged

r-devulap mentioned this pull request Mar 6, 2023

ENH: Use AVX512-FP16 SVML content for float16 umath functions #23351

Merged

r-devulap mentioned this pull request Jul 17, 2023

MAINT: Update meson.build files from main branch #24186

Merged

Rohanjames1997 mentioned this pull request Oct 8, 2024

ENH: Speed up umath functions using NEON/SVE | SIMD #27533

Open

Uh oh!

Conversation

r-devulap commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Jul 8, 2022

Uh oh!

seiko2plus commented Jul 14, 2022

Uh oh!

r-devulap commented Jul 27, 2022

Uh oh!

r-devulap commented Jul 27, 2022

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

r-devulap commented Sep 15, 2022

Uh oh!

seiko2plus commented Sep 18, 2022

Uh oh!

Uh oh!

Uh oh!

seiko2plus Sep 25, 2022

Choose a reason for hiding this comment

Uh oh!

r-devulap Sep 26, 2022

Choose a reason for hiding this comment

Uh oh!

seiko2plus Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Sep 28, 2022

Uh oh!

mattip commented Sep 28, 2022

Uh oh!

charris commented Sep 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

r-devulap commented Jul 8, 2022 •

edited

Loading

seiko2plus Sep 27, 2022 •

edited

Loading