Author: Ritchie Lee, Carnegie Mellon University Silicon Valley, firstname.lastname@example.org
In the face of big data, gaining insights by manually sifting through data is no longer practical. Machine learning methods typically rely on statistical models. Although these may provide good input/output behavior, the results are not conducive to human understanding. We explore machine learning tasks guided by problem-specific grammar that a user provides. We learn expressions derived from the provided grammar, making the results intuitive and interpretable to a human.
GrammarExpts is a collection of modules for experimenting with grammar-guided expression discovery on various problems.
Julia v0.5, 64-bit is required.
Initial tests can be useful in detecting install/dependency problems.
Optional, more in-depth test that includes data processing. Requires the RLESCAS package.
using GrammarExpts, PipelineTest ptest = pipelinetest() #produces data under GrammarExpts/test/PipelineTest #inspect dataset under Dataset/data/exampledata, exampledata_meta, and exampledatafilt cleanup(ptest) #remove created artifacts, except results
In general, first call
using GrammarExpts to make all the submodules globally visible, then call the submodule you want.
cd(Pkg.dir("GrammarExpts/results")) #output directory using GrammarExpts, ACASX_CE #Cross-Entropy method, other algorithms available config = configure(ACASX_CE, "normal", "nvn_dasc") #load a configs. acasx_ce(; config...) #run. Outputs to results directory
cd(Pkg.dir("GrammarExpts/results")) using GrammarExpts, ACASX_CE_Tree #Cross entropy method, other algorithms available config = configure(ACASX_CE_Tree, "normal", "nvn_dasc") #load configs. acasx_ce_tree(; config...) #run. By default will output to current directory
#Create the dataset and filtered dataset #Only have to do this once using GrammarExpts, CASJson2DataFrame process_jsons("mydataset", "/path/to/jsonfiles/") #Learn an expression using GrammarExpts, ACASX_CE config = configure(ACASX_CE, "normal") config[:data] = "mydataset" acasx_ce(; config...) #run. Outputs to results directory #Learn a tree using GrammarExpts, ACASX_CE_Tree config = configure(ACASX_CE_Tree, "normal") config[:data] = "mydatasetfilt" acasx_ce_tree(; config...) #run. By default will output to current directory
These are automatically fetched by the build script:
An expression search problem is defined in a problem module that includes:
The following problems are currently implemented:
The build script automatically copies the default datasets into the Datasets/data folder. The source folders for the data are located under PKGDIR/GrammarExpts/data/datasets.
This is not needed in normal operation, but is good info to know (e.g., to inspect processed data).
GrammarExpts uses Datasets.jl to manage its data. The data is stored one file per encounter in subfolders of
PKGDIR/Datasets/data. To load an entire dataset (collection of encounters), use
using Datasets data = dataset("dasc") #dataset name is also folder name
which will load a DFSet object (collection of DataFrames)
To load a specific encounter file, use
using Datasets D = dataset("dasc", "1") #load dasc dataset encounter 1 into a DataFrame
Many of the main entry points such as
ACASX_CE_Tree use keyword arguments to set configuration parameters. Default parameters are typically set for a quick test run.
For convenience, some modules implement a configuration feature. Config files are stored in a subfolder in the corresponding module. For example,
Each config file is a julia file containing a vector of Symbol/Any pairs, for example:
[ #tree (:maxsteps, 20), #CE (:num_samples, 5000), (:iterations, 100), (:elite_frac, 0.6), (:w_new, 0.4), (:w_prior, 0.1), (:maxsteps, 40), (:default_code, :(eval(false))) ]
that specifies parameter/value pairs. To load the configuration file, use:
config = configure(ACASX_CE_Tree, filename)
Sometimes it is convenient to split up configurations into multiple pieces, for example one file for search and another for data. To load mutiple config files into the same config dict, use:
config = configure(ACASX_CE_Tree, "normal", "nvn_dasc") which loads "normal.jl" and "nvn_dasc.jl"
At this point you can inspect the parameters in the config object or even overwrite it:
config[:param] = newvalue
To use the configuration, splat it into the keyword arguments of the function call:
Values not specified by the configuration take on the defaults specified in the function definition.
Decision tree and visualization:
plotpdf=falseif available in the keyword arguments. The PDF can be later produced from the .json file by calling:
using TikzQTrees plottree("acasx_ce_tree_log_vis_decisiontree.json", outfileroot="acasx_ce_tree_log_vis_decisiontree")
Output logs are in TaggedDFLogger (RLESUtils.Loggers) format, which is a light wrapper around DataFrames. To load the data, call:
using RLESUtils.Loggers logs = load_log("acasx_ce_tree_log.txt") #recursively loads the .csv.gz files keys(logs) #see available logs logs["parameters"] #Parameters log as a dataframe
Alternatively, you can open specific logs as dataframes directly
using DataFrames D = readtable("acasx_ce_tree_log_result.csv.gz")
You may have noticed that DataFrames just stores its data in CSV format, so another way to access the data is to decompress the .csv.gz file into a .csv file and open it as ASCII text. This is especially useful when just taking a quick look at a particular log.
over 1 year ago