02/04/2015
18 days ago
1289 commits
Online algorithms for statistics.
OnlineStats is a Julia package which provides online algorithms for statistical models. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.
Statistic/Model | OnlineStat |
---|---|
Univariate Statistics: | |
mean | Mean |
variance | Variance |
quantiles via SGD | QuantileSGD |
quantiles via Online MM | QuantileMM |
max and min | Extrema |
skewness and kurtosis | Moments |
sum | Sum |
difference | Diff |
Multivariate Analysis: | |
covariance matrix | CovMatrix |
k-means clustering | KMeans |
multiple univariate statistics | MV{<:OnlineStat} |
Density Estimation: | |
gaussian mixture | NormalMix |
Beta | FitBeta |
Categorical | FitCategorical |
Cauchy | FitCauchy |
Gamma | FitGamma |
LogNormal | FitLogNormal |
Normal | FitNormal |
Multinomial | FitMultinomial |
MvNormal | FitMvNormal |
Statistical Learning: | |
GLMs with regularization | StatLearn |
Linear (also Ridge) regression | LinReg |
Other: | |
Bootstrapping | Bootstrap |
approximate count of distinct elements | HyperLogLog |
m = Mean()
v = Variance()
s = Series(m, v)
y = randn(100)
for yi in y
fit!(s, yi)
end
# or more simply:
fit!(s, y)
Series are parameterized by a Weight
type that controls the influence the next observation
has on the OnlineStats contained in the Series.
s = Series(EqualWeight(), Mean())
Consider how weights affect the influence the next observation has on an online mean. Many OnlineStats have an update which takes this form:
Constructor | Weight at Update t |
---|---|
EqualWeight() |
γ(t) = 1 / t |
ExponentialWeight(λ) |
γ(t) = λ |
BoundedEqualWeight(λ) |
γ(t) = max(1 / t, λ) |
LearningRate(r, λ) |
γ(t) = max(1 / t ^ r, λ) |
The Series
type is the workhorse of OnlineStats. A Series tracks
Series(Mean())
Series(Mean(), Variance())
Series(ExponentialWeight(), Mean())
Series(ExponentialWeight(), Mean(), Variance())
y = randn(100)
Series(y, Mean())
Series(y, Mean(), Variance())
Series(y, ExponentialWeight(.01), Mean())
Series(y, ExponentialWeight(.01), Mean(), Variance())
There are multiple ways to update the OnlineStats in a Series
CovMatrix
```
s = Series(Mean())
fit!(s, randn())s = Series(CovMatrix(4)) fit!(s, randn(4)) fit!(s, randn(4))
- Single observation, override weight
s = Series(Mean()) fit!(s, randn(), rand())
- Multiple observations
- Note: multiple observations are a matrix for OnlineStats such as `CovMatrix`. By default, each *row* is considered an observation. However, there exists `fit!` methods which use observations in *columns*.
s = Series(Mean()) fit!(s, randn(100))
s = Series(CovMatrix(4)) fit!(s, randn(100, 4)) # Observations in rows fit!(s, randn(4, 100), ObsDim.Last()) # Observations in columns
- Multiple observations, use the same weight for all
s = Series(Mean()) fit!(s, randn(100), .01)
- Multiple observations, provide vector of weights
s = Series(Mean()) fit!(s, randn(100), rand(100))
- Multiple observations, update in minibatches
OnlineStats which use stochastic approximation (`QuantileSGD`, `QuantileMM`, `KMeans`, etc.) have different behavior if they are updated in minibatches.
s = Series(QuantileSGD()) fit!(s, randn(1000), 7)
[go to top](#readme-contents)
# Merging
Two Series can be merged if they track the same OnlineStats and those OnlineStats are
mergeable. The syntax for in-place merging is
merge!(series1, series2, arg)
Where `series1`/`series2` are Series that contain the same OnlineStats and `arg` is used to determine how `series2` should be merged into `series1`.
using OnlineStats
y1 = randn(100) y2 = randn(100)
s1 = Series(y1, Mean(), Variance()) s2 = Series(y2, Mean(), Variance())
merge!(s1, s2, :append)
merge!(s1, s2, :mean)
merge!(s1, s2, :singleton)
merge!(s1, s2, .5)
[go to top](#readme-contents)
# Callbacks
While an OnlineStat is being updated, you may wish to perform an action like print intermediate results to a log file or update a plot. For this purpose, OnlineStats exports a [`maprows`](doc/api.md#maprows) function.
`maprows(f::Function, b::Integer, data...)`
`maprows` works similar to `Base.mapslices`, but maps `b` rows at a time. It is best used with Julia's do block syntax.
### Example 1
- Input
y = randn(100) s = Series(Mean()) maprows(20, y) do yi fit!(s, yi) info("value of mean is $(value(s))") end
- Output
``` html
INFO: value of mean is 0.06340121912925167
INFO: value of mean is -0.06576995293439102
INFO: value of mean is 0.05374292238752276
INFO: value of mean is 0.008857939006120167
INFO: value of mean is 0.016199508928045905
OnlineStat{I, O}
OnlineStat
has two parameters:
I
: The input dimension. The size of one observationO
: The output dimension/object. The size/object of value
I
. This is because when you call a method like fit!(s, randn(100))
, the Series needs to know whether randn(100)
should be treated as 100 scalar observations or a single vector observation.fit!
and value
fit!
updates the "sufficient statistics" of an OnlineStat, but does not necessarily update the parameter of interest.value
creates the parameter of interest from the "sufficient statistics"value
is not needed while updating a chunk of data.julia-observer-html-cut-paste-1__work