BUG: Fixes for np.random.zipf#9824
Conversation
Current double to long casting in the zipf function depends on non-standardized behavior when the double is too big to fit in a long. This is potentially dangerous and makes the code fail with tools such as AddressSanitizer. Checks are added here to prevent overflow during casting and make sure we get the desired behavior.
numpy/random/mtrand/distributions.c
Outdated
There was a problem hiding this comment.
Can you move these declarations inside the loop now?
There was a problem hiding this comment.
Yes, can do. I wonder why T is initialized before entry. I suspect an historical artifact ...
numpy/random/mtrand/distributions.c
Outdated
There was a problem hiding this comment.
Does inverting this condition have any nan-related effect? What happens if a == 1 before and after this patch?
There was a problem hiding this comment.
Or if you don't want to think about that, just leave it as before but replace the below break with continue, and return (long) X at the end of the loop.
There was a problem hiding this comment.
Yes, it works better :) A nan will compare false and the loop will repeat. I don't think any of the distributions were coded with nans in mind, but I don't think we will get any nans here. In any case, theoretically b > 1, and the function computing b is well behaved when b is near 1, so I don't think it is a problem in practice either.
One could rewrite the condition without divisions, and I think that would be safe, but that is a bigger modification.
There was a problem hiding this comment.
Of course, the library functions could be buggy, and (b - 1.0) negative, in which case getting rid of the divisions would be an improvement, but rejection based algorithms are always uncertain to the degree that the functions used change between libraries and architectures.
There was a problem hiding this comment.
A nan will compare false and the loop will repeat
My concern here is trapping the algorithm in an infinite loop on nan, where previously it would return (admittedly a bad value).
In fact, a == nan would cause that behaviour, right?
Can you just flip the condition to what it was before?
if (V*X*(T - 1.0)/(b - 1.0) > T/b) {
continue;
}
return (long) X;
There was a problem hiding this comment.
Hmm, seems it's already broken on np.nan in 1.12 anyway(#9829), so I guess this is out of scope
There was a problem hiding this comment.
The validation of a is in the calling routine, so that is where the nan check should be. I can fix that. I wouldn't be surprised if other routines can also fail with nan. Long loops for a slightly bigger than 1 indicates a faulty algorithm. Rejection should not happen that often in a good algorithm.
Avoids infinite loop.
|
Checked for NaN input. Can put that in separate PR is the seems better. |
|
Apparently there are better algorithms out there that allow |
|
However, the algorithm is already very slow for |
|
The distribution is very long tailed, so looks like some implementations limit the range of outputs. That likely speeds up the computation significantly. |
|
Yeah, even limiting it to the size of long should give a big speedup, maybe by ~1e4 for int64. Unfortunately, size of long varies with platform. I think this function could be greatly improved by adding a truncation point, and maybe a dtype with truncation point set by maximum size. |
|
And int64 range cannot in general yield correct results because the double arithmetic used only has 53 bits precision, hence there are holes in the range of the returned integers. |
|
Here is a timing for truncated zipf (int64) done better. x1657 speedup. |
|
Wherever fixing |
Cleanup of #9680
This fixes two bugs in
np.random.zipf.Current double to long casting in the zipf function depends on undefined behavior
when the double is too big to fit in a long. This is potentially dangerous and makes the
code fail with tools such as AddressSanitizer.
The validation checks are rewritten so that they work properly for even for NaNs