This package contains many statistical recipes for concepts and types introduced in the JuliaStats organization, intended to be used with Plots.jl:

- Types:
- DataFrames
- Distributions

- Recipes:
- histogram/histogram2d
- boxplot
- violin
- marginalhist
- corrplot/cornerplot
- andrewsplot

Initialize:

```
#]add StatPlots # install the package if it isn't installed
using StatPlots
gr(size=(400,300))
```

Table-like data structures, including `DataFrames`

, `IndexedTables`

, `DataStreams`

, etc... (see here for an exhaustive list), are supported thanks to the macro `@df`

which allows passing columns as symbols. Those columns can then be manipulated inside the `plot`

call, like normal `Arrays`

:

```
using DataFrames, IndexedTables
df = DataFrame(a = 1:10, b = 10 .* rand(10), c = 10 .* rand(10))
@df df plot(:a, [:b :c], colour = [:red :blue])
@df df scatter(:a, :b, markersize = 4 .* log.(:c .+ 0.1))
t = table(1:10, rand(10), names = [:a, :b]) # IndexedTable
@df t scatter(2 .* :b)
```

Inside a `@df`

macro call, the `cols`

utility function can be used to refer to a range of columns:

```
@df df plot(:a, cols(2:3), colour = [:red :blue])
```

or to refer to a column whose symbol is represented by a variable:

```
s = :b
@df df plot(:a, cols(s))
```

`cols()`

will refer to all columns of the data table.

In case of ambiguity, symbols not referring to `DataFrame`

columns must be escaped by `^()`

:

```
df[:red] = rand(10)
@df df plot(:a, [:b :c], colour = ^([:red :blue]))
```

The `@df`

macro plays nicely with the new syntax of the Query.jl data manipulation package (v0.8 and above), in that a plot command can be added at the end of a query pipeline, without having to explicitly collect the outcome of the query first:

```
using Query, StatPlots
df |>
@filter(_.a > 5) |>
@map({_.b, d = _.c-10}) |>
@df scatter(:b, :d)
```

The `@df`

syntax is also compatible with Plots grouping machinery:

```
using RDatasets
school = RDatasets.dataset("mlmRev","Hsb82")
@df school density(:MAch, group = :Sx)
```

To group by more than one column, use a tuple of symbols:

```
@df school density(:MAch, group = (:Sx, :Sector), legend = :topleft)
```

To name the legend entries with custom or automatic names (i.e. `Sex = Male, Sector = Public`

) use the curly bracket syntax `group = {Sex = :Sx, :Sector}`

. Entries with `=`

get the custom name you give, whereas entries without `=`

take the name of the column.

The old syntax, passing the `DataFrame`

as the first argument to the `plot`

call is no longer supported.

A GUI based on the Interact package is available to create plots from a table interactively, using any of the recipes defined below. This small app can be deployed in a Jupyter lab / notebook, Juno plot pane, a Blink window or in the browser, see here for instructions.

```
import RDatasets
iris = RDatasets.dataset("datasets", "iris")
using StatPlots, Interact
using Blink
w = Window()
body!(w, dataviewer(iris))
```

```
using RDatasets
iris = dataset("datasets","iris")
@df iris marginalhist(:PetalLength, :PetalWidth)
```

```
@df iris corrplot([:SepalLength :SepalWidth :PetalLength :PetalWidth], grid = false)
```

or also:

```
@df iris corrplot(cols(1:4), grid = false)
```

A correlation plot may also be produced from a matrix:

```
M = randn(1000,4)
M[:,2] .+= 0.8sqrt.(abs.(M[:,1])) .- 0.5M[:,3] .+ 5
M[:,3] .-= 0.7M[:,1].^2 .+ 2
corrplot(M, label = ["x$i" for i=1:4])
```

```
cornerplot(M)
```

```
cornerplot(M, compact=true)
```

```
import RDatasets
singers = RDatasets.dataset("lattice","singer")
@df singers violin(:VoicePart,:Height,marker=(0.2,:blue,stroke(0)))
@df singers boxplot!(:VoicePart,:Height,marker=(0.3,:orange,stroke(2)))
```

Asymmetric violin plots can be created using the `side`

keyword (`:both`

- default,`:right`

or `:left`

), e.g.:

```
singers_moscow = deepcopy(singers)
singers_moscow[:Height] = singers_moscow[:Height] .+ 5
@df singers violin(:VoicePart,:Height, side=:right, marker=(0.2, :blue, stroke(0)), label="Scala")
@df singers_moscow violin!(:VoicePart,:Height, side=:left, marker=(0.2, :red, stroke(0)), label="Moscow")
```

The ea-histogram is an alternative histogram implementation, where every 'box' in the histogram contains the same number of sample points and all boxes have the same area. Areas with a higher density of points thus get higher boxes. This type of histogram shows spikes well, but may oversmooth in the tails. The y axis is not intuitively interpretable.

```
a = [randn(100); randn(100) .+ 3; randn(100) ./ 2 .+ 3]
ea_histogram(a, bins = :scott, fillalpha = 0.4)
```

AndrewsPlots are a way to visualize structure in high-dimensional data by depicting each row of an array or table as a line that varies with the values in columns. https://en.wikipedia.org/wiki/Andrews_plot

```
using RDatasets
iris = dataset("datasets", "iris")
@df iris andrewsplot(:Species, cols(1:4), legend = :topleft)
```

```
using Distributions
plot(Normal(3,5), fill=(0, .5,:orange))
```

```
dist = Gamma(2)
scatter(dist, leg=false)
bar!(dist, func=cdf, alpha=0.3)
```

The `qqplot`

function compares the quantiles of two distributions, and accepts either a vector of sample values or a `Distribution`

. The `qqnorm`

is a shorthand for comparing a distribution to the normal distribution. If the distributions are similar the points will be on a straight line.

```
x = rand(Normal(), 100)
y = rand(Cauchy(), 100)
plot(
qqplot(x, y, qqline = :fit), # qqplot of two samples, show a fitted regression line
qqplot(Cauchy, y), # compare with a Cauchy distribution fitted to y; pass an instance (e.g. Normal(0,1)) to compare with a specific distribution
qqnorm(x, qqline = :R) # the :R default line passes through the 1st and 3rd quartiles of the distribution
)
```

```
groupedbar(rand(10,3), bar_position = :stack, bar_width=0.7)
```

This is the default:

```
groupedbar(rand(10,3), bar_position = :dodge, bar_width=0.7)
```

The `group`

syntax is also possible in combination with `groupedbar`

:

```
ctg = repeat(["Category 1", "Category 2"], inner = 5)
nam = repeat("G" .* string.(1:5), outer = 2)
groupedbar(nam, rand(5, 2), group = ctg, xlabel = "Groups", ylabel = "Scores",
title = "Scores by group and category", bar_width = 0.67,
lw = 0, framestyle = :box)
```

```
using Clustering
D = rand(10, 10)
D += D'
hc = hclust(D, :single)
plot(hc)
```

Population analysis on a table-like data structures can be done using the highly recommended GroupedErrors package.

This external package, in combination with StatPlots, greatly simplifies the creation of two types of plots:

Some simple summary statistics are computed for each experimental subject (mean is default but any scalar valued function would do) and then plotted against some other summary statistics, potentially splitting by some categorical experimental variable.

Some statistical analysis is computed at the single subject level (for example the density/hazard/cumulative of some variable, or the expected value of a variable given another) and the analysis is summarized across subjects (taking for example mean and s.e.m), potentially splitting by some categorical experimental variable.

For more information please refer to the README.

A GUI based on QML and the GR Plots.jl backend to simplify the use of StatPlots.jl and GroupedErrors.jl even further can be found here (usable but still in alpha stage).

07/10/2016

1 day ago

374 commits