dummy-link

StatPlots

Statistical plotting recipes for Plots.jl

Readme

StatPlots

Build Status

Primary author: Thomas Breloff (@tbreloff)

This package contains many statistical recipes for concepts and types introduced in the JuliaStats organization, intended to be used with Plots.jl:

  • Types:
    • DataFrames (for DataTables support, checkout the DataTables branch)
    • Distributions
  • Recipes:
    • histogram/histogram2d
    • boxplot
    • violin
    • marginalhist
    • corrplot/cornerplot

Initialize:

#Pkg.clone("git@github.com:JuliaPlots/StatPlots.jl.git")
using StatPlots
gr(size=(400,300))

The DataFrames support allows passing DataFrame columns as symbols. Operations on DataFrame column can be specified using quoted expressions, e.g.

using DataFrames
df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
plot(df, :a, [:b :c])
scatter(df, :a, :b, markersize = :(4 * log(:c + 0.1)))

If you find an operation not supported by DataFrames, please open an issue. An alternative approach to the StatPlots syntax is to use the DataFramesMeta macro @with. Symbols not referring to DataFrame columns must be escaped by ^() e.g.

using DataFramesMeta
@with(df, plot(:a, [:b :c], colour = ^([:red :blue])))

marginalhist with DataFrames

using RDatasets
iris = dataset("datasets","iris")
marginalhist(iris, :PetalLength, :PetalWidth)


corrplot and cornerplot

M = randn(1000,4)
M[:,2] += 0.8sqrt.(abs.(M[:,1])) - 0.5M[:,3] + 5
M[:,3] -= 0.7M[:,1].^2 + 2
corrplot(M, label = ["x$i" for i=1:4])

cornerplot(M)

cornerplot(M, compact=true)


boxplot and violin

import RDatasets
singers = RDatasets.dataset("lattice","singer")
violin(singers,:VoicePart,:Height,marker=(0.2,:blue,stroke(0)))
boxplot!(singers,:VoicePart,:Height,marker=(0.3,:orange,stroke(2)))

Asymmetric violin plots can be created using the side keyword (:both - default,:right or :left), e.g.:

singers_moscow = deepcopy(singers)
singers_moscow[:Height] = singers_moscow[:Height]+5
myPlot = violin(singers,:VoicePart,:Height, side=:right, marker=(0.2,:blue,stroke(0)), label="Scala")
violin!(singers_moscow,:VoicePart,:Height, side=:left, marker=(0.2,:red,stroke(0)), label="Moscow")


using Distributions
plot(Normal(3,5), fill=(0, .5,:orange))

dist = Gamma(2)
scatter(dist, leg=false)
bar!(dist, func=cdf, alpha=0.3)

Grouped Bar plots

groupedbar(rand(10,3), bar_position = :stack, bar_width=0.7)

tmp

This is the default:

groupedbar(rand(10,3), bar_position = :dodge, bar_width=0.7)

tmp

groupapply for population analysis

There is a groupapply function that splits the data across a keyword argument "group", then applies "summarize" to get average and variability of a given analysis (density, cumulative, hazard rate and local regression are supported so far, but one can also add their own function). To get average and variability there are 3 ways:

  • compute_error = (:across, col_name), where the data is split according to column col_name before being summarized. compute_error = :across splits across all observations. Default summary is (mean, sem) but it can be changed with keyword summarize to any pair of functions.

  • compute_error = (:bootstrap, n_samples), where n_samples fake datasets distributed like the real dataset are generated and then summarized (nonparametric bootstrapping). compute_error = :bootstrap defaults to compute_error = (:bootstrap, 1000). Default summary is (mean, std). This method will work with any analysis but is computationally very expensive.

  • compute_error = :none, where no error is computed or displayed and the analysis is carried out normally.

The local regression uses Loess.jl and the density plot uses KernelDensity.jl. In case of categorical x variable, these function are computed by splitting the data across the x variable and then computing the density/average per bin. The choice of continuous or discrete axis can be forced via axis_type = :continuous or axis_type = :discrete. axis_type = :binned will bin the x axis in equally spaced bins (number given by the nbins keyword, defaulting to 30), and continue the analysis with the binned data, treating it as discrete.

Example use:

using DataFrames
import RDatasets
using StatPlots
gr()
school = RDatasets.dataset("mlmRev","Hsb82");
grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx)
plot(grp_error, line = :path, legend = :topleft)

screenshot 2016-12-19 12 28 27

Keywords for loess or kerneldensity can be given to groupapply:

grp_error = groupapply(:density, school, :CSES; bandwidth = 0.2, compute_error = (:bootstrap,500), group = :Minrty)
plot(grp_error, line = :path)

screenshot 2017-01-10 18 36 48

The bar plot

pool!(school, :Sx)
grp_error = groupapply(school, :Sx, :MAch; compute_error = :across, group = :Minrty)
plot(grp_error, line = :bar)

screenshot 2017-01-10 18 20 51

Density bar plot of binned data versus continuous estimation:

grp_error = groupapply(:density, school, :MAch; axis_type = :binned, nbins = 40, group = :Minrty)
plot(grp_error, line = :bar, color = ["orange" "turquoise"], legend = :topleft)

grp_error = groupapply(:density, school, :MAch; axis_type = :continuous, group = :Minrty)
plot!(grp_error, line = :path, color = ["orange" "turquoise"], label = "")

density

First Commit

07/10/2016

Last Touched

about 1 month ago

Commits

166 commits