25 days ago
This package contains many statistical recipes for concepts and types introduced in the JuliaStats organization, intended to be used with Plots.jl:
#Pkg.clone("email@example.com:JuliaPlots/StatPlots.jl.git") using StatPlots gr(size=(400,300))
DataFrames support allows passing
DataFrame columns as symbols. Operations on DataFrame column can be specified using quoted expressions, e.g.
using DataFrames df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10)) plot(df, :a, [:b :c]) scatter(df, :a, :b, markersize = :(4 * log(:c + 0.1)))
If you find an operation not supported by DataFrames, please open an issue. An alternative approach to the
StatPlots syntax is to use the DataFramesMeta macro
@with. Symbols not referring to DataFrame columns must be escaped by
using DataFramesMeta @with(df, plot(:a, [:b :c], colour = ^([:red :blue])))
using RDatasets iris = dataset("datasets","iris") marginalhist(iris, :PetalLength, :PetalWidth)
M = randn(1000,4) M[:,2] += 0.8sqrt(abs(M[:,1])) - 0.5M[:,3] + 5 M[:,3] -= 0.7M[:,1].^2 + 2 corrplot(M, label = ["x$i" for i=1:4])
import RDatasets singers = RDatasets.dataset("lattice","singer") violin(singers,:VoicePart,:Height,marker=(0.2,:blue,stroke(0))) boxplot!(singers,:VoicePart,:Height,marker=(0.3,:orange,stroke(2)))
using Distributions plot(Normal(3,5), fill=(0, .5,:orange))
dist = Gamma(2) scatter(dist, leg=false) bar!(dist, func=cdf, alpha=0.3)
groupedbar(rand(10,3), bar_position = :stack, bar_width=0.7)
This is the default:
groupedbar(rand(10,3), bar_position = :dodge, bar_width=0.7)
There is a groupapply function that splits the data across a keyword argument "group", then applies "summarize" to get average and variability of a given analysis (density, cumulative and local regression are supported so far, but one can also add their own function). To get average and variability there are 3 ways:
compute_error = (:across, col_name), where the data is split according to column
col_name before being summarized.
compute_error = :across splits across all observations. Default summary is
(mean, sem) but it can be changed with keyword
summarize to any pair of functions.
compute_error = (:bootstrap, n_samples), where
n_samples fake datasets distributed like the real dataset are generated and then summarized (nonparametric
compute_error = :bootstrap defaults to
compute_error = (:bootstrap, 1000). Default summary is
(mean, std). This method will work with any analysis but is computationally very expensive.
compute_error = :none, where no error is computed or displayed and the analysis is carried out normally.
The local regression uses Loess.jl and the density plot uses KernelDensity.jl. In case of categorical x variable, these function are computed by splitting the data across the x variable and then computing the density/average per bin. The choice of continuous or discrete axis can be forced via
axis_type = :continuous or
axis_type = :discrete
using DataFrames import RDatasets using StatPlots gr() school = RDatasets.dataset("mlmRev","Hsb82"); grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx) plot(grp_error, line = :path)
Keywords for loess or kerneldensity can be given to groupapply:
df = groupapply(:density, school, :CSES; bandwidth = 1., compute_error = (:bootstrap,500), group = :Minrty) plot(df, line = :path)
The bar plot
pool!(school, :Sx) grp_error = groupapply(school, :Sx, :MAch; compute_error = :across, group = :Minrty) plot(grp_error, line = :bar)