Naive Bayes classifier. Currently 3 types of NB are supported:

**MultinomialNB**- Assumes variables have a multinomial distribution. Good for text classification. See`examples/nums.jl`

for usage.**GaussianNB**- Assumes variables have a multivariate normal distribution. Good for real-valued data. See`examples/iris.jl`

for usage.**HybridNB**- A hybrid empirical naive Bayes model for a mixture of continuous and discrete features. The continuous features are estimated using Kernel Density Estimation.*Note*: fit/predict methods take`Dict{Symbol/AstractString, Vector}`

rather than a`Matrix`

. Also, discrete features must be integers while continuous features must be floats. If all features are continuous`Matrix`

input is supported.

Since `GaussianNB`

models multivariate distribution, it's not really a "naive" classifier (i.e. no independence assumption is made), so the name may change in the future.

As a subproduct, this package also provides a `DataStats`

type that may be used for incremental calculation of common data statistics such as mean and covariance matrix. See `test/datastatstest.jl`

for a usage example.

Continuous and discrete features as

`Dict{Symbol, Vector}}`

`f_c1 = randn(10) f_c2 = randn(10) f_d1 = rand(1:5, 10) f_d2 = randn(3:7, 10) training_features_continuous = Dict{Symbol, Vector{Float64}}(:c1=>f_c1, :c2=>f_c2) training_features_discrete = Dict{Symbol, Vector{Int}}(:d1=>f_d1, :d2=>f_d2) #discrete features as Int64 hybrid_model = HybridNB(labels) # train the model fit(hybrid_model, training_features_continuous, training_features_discrete, labels) # predict the classification for new events (points): features_c, features_d y = predict(hybrid_model, features_c, features_d)`

Alternatively one can skip declaring the model and train it directly:

`model = train(HybridNB, training_features_continuous, training_features_discrete, labels) y = predict(hybrid_model, features_c, features_d)`

Continuous features only as a

`Matrix`

`X_train = randn(3,400); X_classify = randn(3,10) hybrid_model = HybridNB(labels) # the number of discrete features is 0 so it's not needed fit(hybrid_model, X_train, labels) y = predict(hybrid_model, X_classify)`

Continuous and discrete features as a

`Matrix{Float}`

`#X is a matrix of features # the first 3 rows are continuous training_features_continuous = restructure_matrix(X[1:3, :]) # the last 2 rows are discrete and must be integers training_features_discrete = map(Int, restructure_matrix(X[4:5, :])) # train the model hybrid_model = train(HybridNB, training_features_continuous, training_features_discrete, labels) # predict the classification for new events (points): features_c, features_d y = predict(hybrid_model, features_c, features_d)`

It is useful to train a model once and then use it for prediction many times later. For example, train your classifier on a local machine and then use it on a cluster to classify points in parallel.

There is support for writing `HybridNB`

models to HDF5 files via the methods `write_model`

and `load_model`

. This is useful for interacting with other programs/languages. If the model file is going to be read only in Julia it is easier to use **JLD.jl** for saving and loading the file.

12/07/2014

22 days ago

103 commits