dummy-link

OnlineStats

Online algorithms for statistics.

First Commit

02/04/2015

Last Touched

18 days ago

Commit Count

1289 commits

Readme

OnlineStats Build Status Build status codecov

OnlineStats

Online algorithms for statistics.

OnlineStats is a Julia package which provides online algorithms for statistical models. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.


Readme Contents

  1. What Can OnlineStats Do?
  2. Basics
  3. Weighting
  4. Series
  5. Merging
  6. Callbacks
  7. Low Level Details

What Can OnlineStats Do?

Statistic/Model OnlineStat
Univariate Statistics:
mean Mean
variance Variance
quantiles via SGD QuantileSGD
quantiles via Online MM QuantileMM
max and min Extrema
skewness and kurtosis Moments
sum Sum
difference Diff
Multivariate Analysis:
covariance matrix CovMatrix
k-means clustering KMeans
multiple univariate statistics MV{<:OnlineStat}
Density Estimation:
gaussian mixture NormalMix
Beta FitBeta
Categorical FitCategorical
Cauchy FitCauchy
Gamma FitGamma
LogNormal FitLogNormal
Normal FitNormal
Multinomial FitMultinomial
MvNormal FitMvNormal
Statistical Learning:
GLMs with regularization StatLearn
Linear (also Ridge) regression LinReg
Other:
Bootstrapping Bootstrap
approximate count of distinct elements HyperLogLog

go to top

Basics

Every OnlineStat is a type

m = Mean()
v = Variance()

OnlineStats are grouped by Series

s = Series(m, v)

Updating a Series updates the OnlineStats

y = randn(100)

for yi in y
    fit!(s, yi)
end

# or more simply:
fit!(s, y)

go to top

Weighting

Series are parameterized by a Weight type that controls the influence the next observation has on the OnlineStats contained in the Series.

s = Series(EqualWeight(), Mean())

Consider how weights affect the influence the next observation has on an online mean. Many OnlineStats have an update which takes this form:

Constructor Weight at Update t
EqualWeight() γ(t) = 1 / t
ExponentialWeight(λ) γ(t) = λ
BoundedEqualWeight(λ) γ(t) = max(1 / t, λ)
LearningRate(r, λ) γ(t) = max(1 / t ^ r, λ)

go to top

Series

The Series type is the workhorse of OnlineStats. A Series tracks

  1. The Weight
  2. An OnlineStat or tuple of OnlineStats.

Creating a Series

Series(Mean())
Series(Mean(), Variance())

Series(ExponentialWeight(), Mean())
Series(ExponentialWeight(), Mean(), Variance())

y = randn(100)

Series(y, Mean())
Series(y, Mean(), Variance())

Series(y, ExponentialWeight(.01), Mean())
Series(y, ExponentialWeight(.01), Mean(), Variance())

Updating a Series

There are multiple ways to update the OnlineStats in a Series

  • Single observation
    • Note: A single observation is a vector for OnlineStats such as CovMatrix ``` s = Series(Mean()) fit!(s, randn())

s = Series(CovMatrix(4)) fit!(s, randn(4)) fit!(s, randn(4))

- Single observation, override weight

s = Series(Mean()) fit!(s, randn(), rand())

- Multiple observations
  - Note: multiple observations are a matrix for OnlineStats such as `CovMatrix`.  By default, each *row* is considered an observation.  However, there exists `fit!` methods which use observations in *columns*.

s = Series(Mean()) fit!(s, randn(100))

s = Series(CovMatrix(4)) fit!(s, randn(100, 4)) # Observations in rows fit!(s, randn(4, 100), ObsDim.Last()) # Observations in columns


- Multiple observations, use the same weight for all

s = Series(Mean()) fit!(s, randn(100), .01)

- Multiple observations, provide vector of weights

s = Series(Mean()) fit!(s, randn(100), rand(100))

- Multiple observations, update in minibatches  
  OnlineStats which use stochastic approximation (`QuantileSGD`, `QuantileMM`, `KMeans`, etc.) have different behavior if they are updated in minibatches.  

s = Series(QuantileSGD()) fit!(s, randn(1000), 7)


[go to top](#readme-contents)
# Merging

Two Series can be merged if they track the same OnlineStats and those OnlineStats are
mergeable.  The syntax for in-place merging is

merge!(series1, series2, arg)


Where `series1`/`series2` are Series that contain the same OnlineStats and `arg` is used to determine how `series2` should be merged into `series1`.

using OnlineStats

y1 = randn(100) y2 = randn(100)

s1 = Series(y1, Mean(), Variance()) s2 = Series(y2, Mean(), Variance())

Treat s2 as a new batch of data. Essentially:

s1 = Series(Mean(), Variance()); fit!(s1, y1); fit!(s1, y2)

merge!(s1, s2, :append)

Use weighted average based on nobs of each Series

merge!(s1, s2, :mean)

Treat s2 as a single observation.

merge!(s1, s2, :singleton)

Provide the ratio of influence s2 should have.

merge!(s1, s2, .5)


[go to top](#readme-contents)
# Callbacks

While an OnlineStat is being updated, you may wish to perform an action like print intermediate results to a log file or update a plot.  For this purpose, OnlineStats exports a [`maprows`](doc/api.md#maprows) function.

`maprows(f::Function, b::Integer, data...)`

`maprows` works similar to `Base.mapslices`, but maps `b` rows at a time.  It is best used with Julia's do block syntax.

### Example 1
- Input

y = randn(100) s = Series(Mean()) maprows(20, y) do yi fit!(s, yi) info("value of mean is $(value(s))") end

- Output
``` html
INFO: value of mean is 0.06340121912925167
INFO: value of mean is -0.06576995293439102
INFO: value of mean is 0.05374292238752276
INFO: value of mean is 0.008857939006120167
INFO: value of mean is 0.016199508928045905

go to top

Low Level Details

OnlineStat{I, O}

  • The abstract type OnlineStat has two parameters:
    • I: The input dimension. The size of one observation
    • O: The output dimension/object. The size/object of value
  • A Series can only manage OnlineStats that share the same input type I. This is because when you call a method like fit!(s, randn(100)), the Series needs to know whether randn(100) should be treated as 100 scalar observations or a single vector observation.

fit! and value

  • fit! updates the "sufficient statistics" of an OnlineStat, but does not necessarily update the parameter of interest.
  • value creates the parameter of interest from the "sufficient statistics"
  • This is the convention in order to avoid extra computation costs when the value is not needed while updating a chunk of data.

go to top

julia-observer-html-cut-paste-1__work