BUG: Fixes for np.random.zipf by charris · Pull Request #9824 · numpy/numpy

charris · 2017-10-04T18:36:40Z

Cleanup of #9680

This fixes two bugs in np.random.zipf.

Possible invalid cast of double to long
Current double to long casting in the zipf function depends on undefined behavior
when the double is too big to fit in a long. This is potentially dangerous and makes the
code fail with tools such as AddressSanitizer.
NaNs can slip through the input parameter validation, leading to an infinite loop.
The validation checks are rewritten so that they work properly for even for NaNs

Current double to long casting in the zipf function depends on non-standardized behavior when the double is too big to fit in a long. This is potentially dangerous and makes the code fail with tools such as AddressSanitizer. Checks are added here to prevent overflow during casting and make sure we get the desired behavior.

eric-wieser · 2017-10-05T06:28:46Z

numpy/random/mtrand/distributions.c

Can you move these declarations inside the loop now?

Yes, can do. I wonder why T is initialized before entry. I suspect an historical artifact ...

eric-wieser · 2017-10-05T06:30:33Z

numpy/random/mtrand/distributions.c

Does inverting this condition have any nan-related effect? What happens if a == 1 before and after this patch?

Or if you don't want to think about that, just leave it as before but replace the below break with continue, and return (long) X at the end of the loop.

Yes, it works better :) A nan will compare false and the loop will repeat. I don't think any of the distributions were coded with nans in mind, but I don't think we will get any nans here. In any case, theoretically b > 1, and the function computing b is well behaved when b is near 1, so I don't think it is a problem in practice either.

One could rewrite the condition without divisions, and I think that would be safe, but that is a bigger modification.

Of course, the library functions could be buggy, and (b - 1.0) negative, in which case getting rid of the divisions would be an improvement, but rejection based algorithms are always uncertain to the degree that the functions used change between libraries and architectures.

A nan will compare false and the loop will repeat

My concern here is trapping the algorithm in an infinite loop on nan, where previously it would return (admittedly a bad value).

In fact, a == nan would cause that behaviour, right?

Can you just flip the condition to what it was before?

if (V*X*(T - 1.0)/(b - 1.0) > T/b) { continue; } return (long) X;

Hmm, seems it's already broken on np.nan in 1.12 anyway(#9829), so I guess this is out of scope

The validation of a is in the calling routine, so that is where the nan check should be. I can fix that. I wouldn't be surprised if other routines can also fail with nan. Long loops for a slightly bigger than 1 indicates a faulty algorithm. Rejection should not happen that often in a good algorithm.

Avoids infinite loop.

charris · 2017-10-06T15:06:40Z

Checked for NaN input. Can put that in separate PR is the seems better.

charris · 2017-10-06T16:03:19Z

Apparently there are better algorithms out there that allow a > 0, although I don't know how well they perform near the singularity. I'm thinking that near the singularity the probability that X > max_long, where X is the generated sample, must approach 1, so the problem is inherent. We should probably document that and put a limit on the number of loops so that a error will be raised on failure.

charris · 2017-10-06T16:34:48Z

However, the algorithm is already very slow for a=1.000001

In [2]: %timeit a = np.random.zipf(1.000001, size=(1,))
1000 loops, best of 3: 1.55 ms per loop

In [3]: %timeit a = np.random.random(size=(1,))
1000000 loops, best of 3: 548 ns per loop

charris · 2017-10-06T17:03:06Z

Apache license java implementation at https://github.com/apache/commons-math/blob/138f84bfa5d36c8f6e2825640af1ed82daa9dc1d/src/main/java/org/apache/commons/math4/distribution/ZipfDistribution.java

charris · 2017-10-06T17:06:04Z

The distribution is very long tailed, so looks like some implementations limit the range of outputs. That likely speeds up the computation significantly.

charris · 2017-10-06T19:27:44Z

Yeah, even limiting it to the size of long should give a big speedup, maybe by ~1e4 for int64. Unfortunately, size of long varies with platform. I think this function could be greatly improved by adding a truncation point, and maybe a dtype with truncation point set by maximum size.

charris · 2017-10-06T20:26:00Z

And int64 range cannot in general yield correct results because the double arithmetic used only has 53 bits precision, hence there are holes in the range of the returned integers.

charris · 2017-10-06T20:39:50Z

Here is a timing for truncated zipf (int64) done better.

In [1]: %timeit a = np.random.zipf(1.000001, size=(1,))
1000000 loops, best of 3: 935 ns per loop

x1657 speedup.

eric-wieser · 2017-10-10T04:28:59Z

Wherever fixing zipf leads, this is clearly an improvement as is.

charris added 00 - Bug component: numpy.random labels Oct 4, 2017

charris mentioned this pull request Oct 4, 2017

BUG: Do not rely on undefined casting behavior in numpy.random. #9680

Closed

eric-wieser reviewed Oct 5, 2017

View reviewed changes

MAINT: Refactor rk_zipf to be more efficient.

e6932a8

charris force-pushed the gh-9680 branch from 9950b82 to e6932a8 Compare October 5, 2017 15:38

BUG: Check for NaN parameter in random.zipf.

45e0093

Avoids infinite loop.

charris changed the title ~~BUG: Fix possibly undefined cast of double -> long.~~ BUG: Fixes for np.random.zipf Oct 6, 2017

eric-wieser merged commit bb5d666 into numpy:master Oct 10, 2017

charris deleted the gh-9680 branch October 14, 2017 14:24

charris mentioned this pull request Apr 11, 2019

BUG: np.random.zipf hangs the interpreter on pathological input #9829

Closed

This was referenced Apr 11, 2019

Sync randomgen mattip/numpy#7

Closed

Sync randomgen mattip/numpy#8

Merged

Uh oh!

Conversation

charris commented Oct 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Oct 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017

Uh oh!

charris commented Oct 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Oct 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charris commented Oct 4, 2017 •

edited

Loading

eric-wieser Oct 6, 2017 •

edited

Loading

charris commented Oct 6, 2017 •

edited

Loading