Skip to content

BUG: Fixes for np.random.zipf#9824

Merged
eric-wieser merged 3 commits intonumpy:masterfrom
charris:gh-9680
Oct 10, 2017
Merged

BUG: Fixes for np.random.zipf#9824
eric-wieser merged 3 commits intonumpy:masterfrom
charris:gh-9680

Conversation

@charris
Copy link
Member

@charris charris commented Oct 4, 2017

Cleanup of #9680

This fixes two bugs in np.random.zipf.

  • Possible invalid cast of double to long
    Current double to long casting in the zipf function depends on undefined behavior
    when the double is too big to fit in a long. This is potentially dangerous and makes the
    code fail with tools such as AddressSanitizer.
  • NaNs can slip through the input parameter validation, leading to an infinite loop.
    The validation checks are rewritten so that they work properly for even for NaNs

Current double to long casting in the zipf function depends on
non-standardized behavior when the double is too big to fit in a long.
This is potentially dangerous and makes the code fail with tools such as
AddressSanitizer.

Checks are added here to prevent overflow during casting and make sure
we get the desired behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these declarations inside the loop now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can do. I wonder why T is initialized before entry. I suspect an historical artifact ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does inverting this condition have any nan-related effect? What happens if a == 1 before and after this patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you don't want to think about that, just leave it as before but replace the below break with continue, and return (long) X at the end of the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works better :) A nan will compare false and the loop will repeat. I don't think any of the distributions were coded with nans in mind, but I don't think we will get any nans here. In any case, theoretically b > 1, and the function computing b is well behaved when b is near 1, so I don't think it is a problem in practice either.

One could rewrite the condition without divisions, and I think that would be safe, but that is a bigger modification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, the library functions could be buggy, and (b - 1.0) negative, in which case getting rid of the divisions would be an improvement, but rejection based algorithms are always uncertain to the degree that the functions used change between libraries and architectures.

Copy link
Member

@eric-wieser eric-wieser Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nan will compare false and the loop will repeat

My concern here is trapping the algorithm in an infinite loop on nan, where previously it would return (admittedly a bad value).

In fact, a == nan would cause that behaviour, right?

Can you just flip the condition to what it was before?

if (V*X*(T - 1.0)/(b - 1.0) > T/b) {
    continue;
}
return (long) X;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, seems it's already broken on np.nan in 1.12 anyway(#9829), so I guess this is out of scope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation of a is in the calling routine, so that is where the nan check should be. I can fix that. I wouldn't be surprised if other routines can also fail with nan. Long loops for a slightly bigger than 1 indicates a faulty algorithm. Rejection should not happen that often in a good algorithm.

@charris
Copy link
Member Author

charris commented Oct 6, 2017

Checked for NaN input. Can put that in separate PR is the seems better.

@charris charris changed the title BUG: Fix possibly undefined cast of double -> long. BUG: Fixes for np.random.zipf Oct 6, 2017
@charris
Copy link
Member Author

charris commented Oct 6, 2017

Apparently there are better algorithms out there that allow a > 0, although I don't know how well they perform near the singularity. I'm thinking that near the singularity the probability that X > max_long, where X is the generated sample, must approach 1, so the problem is inherent. We should probably document that and put a limit on the number of loops so that a error will be raised on failure.

@charris
Copy link
Member Author

charris commented Oct 6, 2017

However, the algorithm is already very slow for a=1.000001

In [2]: %timeit a = np.random.zipf(1.000001, size=(1,))
1000 loops, best of 3: 1.55 ms per loop

In [3]: %timeit a = np.random.random(size=(1,))
1000000 loops, best of 3: 548 ns per loop

@charris
Copy link
Member Author

charris commented Oct 6, 2017

@charris
Copy link
Member Author

charris commented Oct 6, 2017

The distribution is very long tailed, so looks like some implementations limit the range of outputs. That likely speeds up the computation significantly.

@charris
Copy link
Member Author

charris commented Oct 6, 2017

Yeah, even limiting it to the size of long should give a big speedup, maybe by ~1e4 for int64. Unfortunately, size of long varies with platform. I think this function could be greatly improved by adding a truncation point, and maybe a dtype with truncation point set by maximum size.

@charris
Copy link
Member Author

charris commented Oct 6, 2017

And int64 range cannot in general yield correct results because the double arithmetic used only has 53 bits precision, hence there are holes in the range of the returned integers.

@charris
Copy link
Member Author

charris commented Oct 6, 2017

Here is a timing for truncated zipf (int64) done better.

In [1]: %timeit a = np.random.zipf(1.000001, size=(1,))
1000000 loops, best of 3: 935 ns per loop

x1657 speedup.

@eric-wieser eric-wieser merged commit bb5d666 into numpy:master Oct 10, 2017
@eric-wieser
Copy link
Member

Wherever fixing zipf leads, this is clearly an improvement as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants