Histogram by LukeMathWalker · Pull Request #8 · rust-ndarray/ndarray-stats

LukeMathWalker · 2018-10-03T11:10:44Z

It took me a while, but I managed to pull together a good candidate for histogram-related functionalities (histogram, histogram2d and histogramdd in Numpy).

Bins are formalized in their own submodule, bins:

Bin1d, basic building block, reuses Range* structs from the standard library;
Bins1d, collection of Bin1d;
BinNd, n-dimensional bin. It supports any region in an n-dimensional space that can be expressed as a product of intervals;
BinsNd, collection of n-dimensional bins.

With BinsNd we can express collections of bins that are more general than the ones supported by NumPy.

Two top-level traits are provided to implement an histogram method, for ArrayBase and one-dimensional ArrayBase. Right now the computational complexity is the same, apart from a slightly different API (point being ArrayBase<Data<Elem=T>> or a &T, but I plan to enhance Bin1d::contains to enable an implementation of Bins1d::find that uses binary search. This will probably require to enforce non-overlapping bins at creation time for Bins1d.

I am now going to implement all the different string options for bins in np.histogram_bin_edges as builders, but I wanted to get the PR out to get your feedback on the direction @jturner314

jturner314 · 2018-10-04T05:41:48Z

Thanks for working on this. Unfortunately, I'm not really convinced with the approach.

With BinsNd we can express collections of bins that are more general than the ones supported by NumPy.

What's the use-case for this? Maybe I'm biased by how NumPy does things, but when I think of a histogram, I generally think of the bins as subdividing a space by splitting each axis at specific edges. So, in the 2-D case, I would expect to divide the space like this:

+---+-------+-+
|   |       | |
+---+-------+-+
|   |       | |
|   |       | |
|   |       | |
|   |       | |
+---+-------+-+

but generally not this:

+---+-------+-+
|   |       | |
|   +-------+-+
|   |         |
|   |         |
|   |         |
|   |         |
+---+-------+-+

or overlapping bins. I suppose I can think of some plausible uses for weird bins, but I think it makes more sense to make the common case (rectangular bins defined by splitting the axes at edges) fast and simple to use.

In other words, I like NumPy's edge-based representation. So, for example, for the 3-D case, we'd need edges for three axes, so we could represent this as edges = Vec<Array1<A>> where edges[0] is the array of edges for the first axis, edges[1] is the array of edges for the second axis, and edges[2] is the array of edges for the third axis. The total number of n-D bins represented this way is (edges[0].len() - 1) * (edges[1].len() - 1) * (edges[2].len() - 1).

I like this NumPy-style representation better than BinsNd for a few reasons:

The histogram counts can be represented as an n-D array. The BinsNd approach works best with a 1-D representation like HashMap (the current PR), BTreeMap or Vec, but this loses the shape information. For example, iterating along a particular axis is difficult. Even iterating over the all bins in a sensible order is difficult with HashMap. (I think you could handle the iteration over all bins in a good order by switching to a BTreeMap and carefully defining the ordering of BinsNd or switching to a Vec the same length as the Vec of bins, but it still would not be as convenient or flexible as an n-D array.)
It's a lot easier to plot a histogram given the edges along each axis instead of BinsNd. For example, consider the problem of determining the locations of the ticks along axis 1. With the NumPy-style representation, this is simply edges[1]. With the BinsNd representation it's not obvious. While we don't have any good plotting libraries in Rust at the moment, I think the tasks involved in plotting a histogram are fairly representative of the types of things you might want to do with a histogram.
The NumPy-style representation only allows axes to be split. It does not allow for "merged" bins (the second picture I provided above) or overlapping bins. This is more restrictive, but it also makes working with the NumPy-style representation simpler because you don't have to worry about those special cases.
The size in memory of BinsNd is O(n^ndim * ndim) where n is the number of bins along each axis (assuming the number of bins along each axis is the same). (There are n^ndim total bins, and each one has size ndim.) In contrast, the size in memory of the NumPy-style representation is O(n*ndim). (There are ndim axes, and each one has n+1 edges.) This is a substantial difference, even for a moderately-sized case like n = 30, ndim = 3. I suppose you could argue that the size in memory of the counts is O(n^ndim), so why worry about the bins having a similar big-O, but I think it still makes sense to keep the memory required for the bins small.
Possible performance differences (keep in mind these are just educated guesses without benchmarks):
- The number of Vecs in the BinsNd representation is 1 + n^ndim. (1 for the outer Vec and n^ndim Vecs for all of the bins.) The large number of Vecs is problematic if the memory allocator doesn't do a good job keeping them close together in memory. (If the Vecs are scattered across memory, caching doesn't work very well. Fortunately, IIUC jemalloc is pretty good at keeping allocations close together, so this may not actually be an issue in practice.) It would be possible to avoid this issue for fixed-dimension BinNd by using an alternative representation for BinNd (e.g. { lower_bounds: D, upper_bounds: D }). The NumPy-style representation doesn't have this issue at all, though; the number of Vecs is 1 + ndim. (1 for the outer Vec, and one for each axis.)
- Interestingly enough, if the bins are sorted, both representations have the same big-O cost of finding the correct bin with binary search (O(log(n^ndim)) = O(ndim * log(n))), but I would guess that the NumPy-style representation would be faster because it's more cache-friendly.

So, what I'd suggest is something like this:

/// Wrapper around `Array1` that makes sure the elements are in ascending order.
struct Edges<A: Ord> {
    edges: Array1<A>,
}

impl<A: Ord> From<Array1<A>> for Edges<A> {
    fn from(mut edges: Array1<A>) -> Self {
        // sort the array in-place
        Edges { edges }
    }
}

impl<A: Ord> Edges<A> {
    fn view(&self) -> ArrayView1<A> {
        self.edges.view()
    }

    /// Returns the index of the bin containing the given value,
    /// or `None` if none of the bins contain the value.
    fn bin_index(&self, value: &A) -> Option<usize> {
        // binary search for the correct bin
    }

    /// Returns the range of the bin containing the given value.
    fn bin_range(&self, value: &A) -> Option<Range<A>>
    where
        A: Clone,
    {
        let i = self.bin_index(value);
        Range { start: self.edges[i].clone(), end: self.edges[i + 1].clone() }
    }
}

struct HistogramCounts<A: Ord> {
    counts: ArrayD<usize>,
    edges: Vec<Edges<A>>,
}

struct HistogramDensity<A: Ord> {
    density: ArrayD<A>,
    edges: Vec<Edges<A>>,
}

impl<A: Ord> HistogramCounts<A> {
    pub fn new(edges: Vec<Edges<A>>) -> Self {
        let counts = ArrayD::zeros(edges.iter().map(|e| e.len() - 1).collect::<Vec<_>>());
        HistogramCounts { counts, edges }
    }

    pub fn add_observation(observation: ArrayView1<A>) -> Result<(), NotFound> {
        let bin = observation
            .iter()
            .zip(&self.edges)
            .map(|(v, e)| e.bin_index(v).ok_or(NotFound))
            .collect::<Result<Vec<_>, _>>()?;
        self.counts[bin] += 1;
    }
}

A few possible modifications:

An EdgesNd wrapper type for Vec<Edges<A>> might be nice to provide methods for finding the n-D index of the bin corresponding to a value and other similar operations. Alternatively, we could provide this functionality on the Histogram* types.
We could combine HistogramCounts and HistogramDensity into a single type with a parameter for the element type of the counts/density array. I'm pretty ambivalent about this, because I think a Histogram trait implemented for HistogramCounts and HistogramDensity would work just as well as a common Histogram struct.
We could add the dimensionality to the type of Edges/Histogram*. I think this is probably worth doing.
Instead of Edges containing Array1<A>, it could contain Vec<A> so that we could take advantage of the standard library's sorting and searching methods. (ndarray doesn't have an equivalent to the standard library's .unstable_sort() or .binary_search().)
I've used a requirement of A: Ord for simplicity, which would require using N64 (noisy-float's "not NaN" wrapper type) instead of f64, for example. We might come up with some other way to allow PartialOrd. This is similar to the problem we had with the quantile methods.

What are your thoughts?

LukeMathWalker · 2018-10-04T12:14:16Z

Well, I think is an understatement to say that I didn't think it through and that your comment makes total sense 😅

I wanted to allow arbitrary meshes, but we should definitely prioritize memory-consumption and performance with respect to the most common use cases. I'll start over from your skeleton!

LukeMathWalker added 30 commits October 1, 2018 10:07

Added new submodule

89198b1

Barebone trait HistogramExt added and exported

1d019f1

New module for bin functionalities

231d2bf

Skeleton of the bin system

bedd5d3

Bins have to be generic with respect to element type T

3bd8dc1

Added barebone of histogram trait, both 1d and Nd

979ae11

As far as I can go with this structure

f3f02e0

Getting rid of Bin and Bins traits, for now

d2b29d1

Added bounds to allow storage in HashMap

2806eaa

Let's ignore code duplication for now - histogram 1d implemented

28ef955

Restructured as submodules

dbc4993

Constructors have been implemented for BinNd and BinsNd

a3358ea

Implemented Display for Bins in order to show proper error messages

f6b88ad

ndim method has to be public

c3ab8a8

Added Clone bound

8d43d5e

Implemented find method. Added clone bound

eda210e

Added contains method to Bin1d

1ac42e4

Taking a reference, instead of moving value

0bea875

Implemented contains for BinNd

edfc413

Added contains method implementation for Bin1d

af62d85

Using std::ops::Range* to implement Bin1d as an enum

3f0aacc

Implemented find

bf9965e

Fixed docs

641ed57

Unglobed import

4ea33b7

Rexporting functions

e856951

Added example in docs

782fb1b

Added docstring to contains method

ac9d924

Test time!

be08f35

Testing Bins1d

50db577

Minimizing trait bounds

de277ed

LukeMathWalker added 25 commits October 3, 2018 09:11

Too much effort - let's use arrays

3e37267

Fixed doc

546dea5

Added error message if it panics

8325807

Adding IDE-related files to .gitignore

d2fa588

Fixed doc-test, added new test

ba80459

Added new test for BinNd - dimension mismatch in contains

f799c8b

Added test for contains w matching dimensions

955c04b

Rename test module

addd56f

Added test for BinsNd - empty vec to new

ac5520a

Fixed dim computation in BinsNd::new - added a test for it

e0a3cab

Remove fmt::Debug bound on T given that is not used anymore

03d79a1

Added new test for find with mismatched dimensions

85bab2f

Improved test

b61a6aa

Added new test for BinsNd::find

8306259

Histogram is now updated correctly

d597f15

Added a test for histogram

f42aa48

Fixed doc for HistogramNdExt and added a test for the panic case

2ac7357

Added docstring to Histogram1dExt::histogram

2758e01

Hyperlinks everywhere

7713089

Hyperlinks everywhere

018840e

Docstring for traits in histogram.mod

07803b8

Missing one / to show up in docs

3616a97

Encapsulated bins in bins module

aa46698

Fixed tests

2ec7af4

Added module docs

d2f4beb

LukeMathWalker mentioned this pull request Oct 18, 2018

Histogram (revisited) #9

Merged

LukeMathWalker closed this Nov 13, 2018

LukeMathWalker deleted the histogram branch November 18, 2018 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram#8

Histogram#8
LukeMathWalker wants to merge 72 commits intorust-ndarray:masterfrom
LukeMathWalker:histogram

LukeMathWalker commented Oct 3, 2018

Uh oh!

jturner314 commented Oct 4, 2018

Uh oh!

LukeMathWalker commented Oct 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LukeMathWalker commented Oct 3, 2018

Uh oh!

jturner314 commented Oct 4, 2018

Uh oh!

LukeMathWalker commented Oct 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants