vignettes/NDR_object_specification.Rmd
NDR_object_specification.Rmd
The NeuroDecodeR (NDR) package is designed around five abstract object types which are:
Datasources (DS): Generate training and test splits of the data.
Feature preprocessors (FP): Learn parameters on the training set and apply transformations to the training and test sets.
Classifiers (CL): Learn the relationship between experimental conditions (i.e., “labels”) and neural data on a training set, and then predict experimental conditions neural data in a test set.
Result metrics (RM): Aggregate results across validation splits and over resampled runs and compute and plot final decoding accuracy metrics.
Cross-validators (CV): Take the DS, FP, CL and RM objects and run a cross-validation decoding procedure.
By having a standard set of object types, one can easily use different instances of these five object types to do different types of analyses.
For most common analyses, one can use instances of these different object types that come with the NDR. However, in some cases, one might want to extend the functionality of the NDR to gain additional insights. For example, one might want to try a different classifier to gain a better understanding of how populations of neurons code information (e.g., see Meyers, Borzello, Freiwald and Tsao, J Neurosci 2015).
The following document describes the methods and data formats that need to be implemented to create valid DS, FP, CL, RM, and CV object types. By creating new classes of objects that conform to these interfaces, one can easily extend the NDR to try new analyses.
Datasources are used to generate training and tests splits of data.
All datasources must implement a get_data()
method that
returns a data frame that has the following variables in it:
train_labels
: The label levels that occur on each
trial in the training data set
test_labels
: The label levels that occur on each
trial in the test data set
time_bin
: The time in the experiment where the
test data comes from
site_XXX
: A collection of variables that each has
data from one site (e.g., neuron, EEG channel etc.)
CV_XXX
: A list for each CV split whether a given row
is in that train or test set
Like all NDR objects, DS objects must also implement a
get_properties()
method which returns a data frame with one
row that lists all the properties that have been set to allow for
reproducible research.
Here is an example the data returned by the ds_basic()
datasource
data_file_name <- system.file(file.path("extdata", "ZD_150bins_50sampled.Rda"), package="NeuroDecodeR")
ds <- ds_basic(data_file_name, 'stimulus_ID', 18)
## Automatically selecting sites_IDs_to_use. Since num_cv_splits = 18 and num_label_repeats_per_cv_split = 1, all sites that have 18 repetitions have been selected. This yields 132 sites that will be used for decoding (out of 132 total).
## [1] "train_labels" "test_labels" "time_bin" "site_0001" "site_0002"
## [6] "site_0003" "site_0004" "site_0005" "site_0006" "site_0007"
## [11] "site_0008" "site_0009" "site_0010" "site_0011" "site_0012"
## [16] "site_0013" "site_0014" "site_0015" "site_0016" "site_0017"
## [21] "site_0018" "site_0019" "site_0020" "site_0021" "site_0022"
## [26] "site_0023" "site_0024" "site_0025" "site_0026" "site_0027"
## [31] "site_0028" "site_0029" "site_0030" "site_0031" "site_0032"
## [36] "site_0033" "site_0034" "site_0035" "site_0036" "site_0037"
## [41] "site_0038" "site_0039" "site_0040" "site_0041" "site_0042"
## [46] "site_0043" "site_0044" "site_0045" "site_0046" "site_0047"
## [51] "site_0048" "site_0049" "site_0050" "site_0051" "site_0052"
## [56] "site_0053" "site_0054" "site_0055" "site_0056" "site_0057"
## [61] "site_0058" "site_0059" "site_0060" "site_0061" "site_0062"
## [66] "site_0063" "site_0064" "site_0065" "site_0066" "site_0067"
## [71] "site_0068" "site_0069" "site_0070" "site_0071" "site_0072"
## [76] "site_0073" "site_0074" "site_0075" "site_0076" "site_0077"
## [81] "site_0078" "site_0079" "site_0080" "site_0081" "site_0082"
## [86] "site_0083" "site_0084" "site_0085" "site_0086" "site_0087"
## [91] "site_0088" "site_0089" "site_0090" "site_0091" "site_0092"
## [96] "site_0093" "site_0094" "site_0095" "site_0096" "site_0097"
## [101] "site_0098" "site_0099" "site_0100" "site_0101" "site_0102"
## [106] "site_0103" "site_0104" "site_0105" "site_0106" "site_0107"
## [111] "site_0108" "site_0109" "site_0110" "site_0111" "site_0112"
## [116] "site_0113" "site_0114" "site_0115" "site_0116" "site_0117"
## [121] "site_0118" "site_0119" "site_0120" "site_0121" "site_0122"
## [126] "site_0123" "site_0124" "site_0125" "site_0126" "site_0127"
## [131] "site_0128" "site_0129" "site_0130" "site_0131" "site_0132"
## [136] "CV_1" "CV_2" "CV_3" "CV_4" "CV_5"
## [141] "CV_6" "CV_7" "CV_8" "CV_9" "CV_10"
## [146] "CV_11" "CV_12" "CV_13" "CV_14" "CV_15"
## [151] "CV_16" "CV_17" "CV_18"
Feature preprocessors learn a set of parameters from the training data and modify both the training and the test data based on these parameters, prior to the data being sent to the classifier. The features preprocessor objects must only use the training data to learn the preprocessing parameters in order to prevent contamination between the training and test data which could bias the results.
All feature preprocessors must implement
preprocess_data()
. This method takes two data frames called
training_set and test_set have the following
variables:
training_labels
: The labels used to train the
classifier.site_X
: a group of variables that has data from
multiple sites.test_labels
: The labels used to test the
classifiersite_X
: a group of variables that has data from
multiple sitestime_bin
: character strings listing which times
different rows correspond toThe preprocess_data()
returns a list with the two data
frames training_set and test_set but the data in these
data frames has been preprocessed based on parameters learned from the
training_set.
Like all NDR objects, FP objects must also implement a
get_properties()
method which returns a data frame with one
row that lists all the properties that have been set to allow for
reproducible research.
If you want to implement a new FP object yourself, below is an example of how the FP object gets and returns data.
# create a ds_basic to get the data
data_file_name <- system.file(file.path("extdata", "ZD_150bins_50sampled.Rda"), package="NeuroDecodeR")
ds <- ds_basic(data_file_name, 'stimulus_ID', 18)
## Automatically selecting sites_IDs_to_use. Since num_cv_splits = 18 and num_label_repeats_per_cv_split = 1, all sites that have 18 repetitions have been selected. This yields 132 sites that will be used for decoding (out of 132 total).
cv_data <- get_data(ds)
# an example of spliting the data into a training and test set,
# this is done in the cross-validator
training_set <- dplyr::filter(cv_data,
time_bin == "time.100_250",
CV_1 == "train") %>% # get data from the first CV split
dplyr::select(starts_with("site"), train_labels)
test_set <- dplyr::filter(cv_data, CV_1 == "test") %>% # get data from the first CV split
dplyr::select(starts_with("site"), test_labels, time_bin)
# use the fp object to normalize the data
fp <- fp_zscore()
processed_data <- preprocess_data(fp, training_set, test_set)
# prior to z-score normalizing the mean (e.g. for site 1) is not 0
mean(training_set$site_0001)
## [1] 0.002969188
# after normalizing the data the mean is pretty much 0
mean(processed_data$training_set$site_0001)
## [1] -8.864674e-17
Classifiers take a set of training data and training labels, and learn a model of the relationship between the training data and the labels from the different classes. Once this model has been learned (i.e., once the classifier has been trained), the classifier is then used to make predictions about what labels were present in a new set of test data.
Objects that are classifiers must implement the
get_predictions()
method. This method takes two data frames
called training_set and test_set have the following
variables:
training_labels
: The labels used to train the
classifier.site_X
: a group of variables that has data from
multiple sites.test_labels
The labels used to test the
classifier.site_X
: a group of variables that has data from
multiple sites.time_bin
: character strings listing which times
different rows correspond to.The get_predictions()
returns a data frame that has the
following variables:
test_time
: a character vector indicating the times
which the rows come from
actual_labels
: the labels that were actually present
on each trial
predicted_labels
: the labels that the classifier
predicted
decision_vals.X
(optional): a group of variables
that has values that indicate how strongly the classifier rates a test
point as coming from a particular class
Like all NDR objects, CL objects must also implement a
get_properties()
method which returns a data frame with one
row that lists all the properties that have been set to allow for
reproducible research.
If you want to implement a new CL object yourself, below is an example of how the CL object gets and returns data.
# create a ds_basic to get the data
data_file_name <- system.file(file.path("extdata", "ZD_150bins_50sampled.Rda"), package="NeuroDecodeR")
ds <- ds_basic(data_file_name, 'stimulus_ID', 18)
## Automatically selecting sites_IDs_to_use. Since num_cv_splits = 18 and num_label_repeats_per_cv_split = 1, all sites that have 18 repetitions have been selected. This yields 132 sites that will be used for decoding (out of 132 total).
cv_data <- get_data(ds)
# an example of spliting the data into a training and test set,
# this is done in the cross-validator
training_set <- dplyr::filter(cv_data,
time_bin == "time.100_250",
CV_1 == "train") %>% # get data from the first CV split
dplyr::select(starts_with("site"), train_labels)
test_set <- dplyr::filter(cv_data, CV_1 == "test") %>% # get data from the first CV split
dplyr::select(starts_with("site"), test_labels, time_bin)
# use the cl object to make predictions
cl <- cl_max_correlation()
predictions <- get_predictions(cl, training_set, test_set)
names(predictions)
## [1] "test_time" "actual_labels" "predicted_labels"
## [4] "decision_vals.car" "decision_vals.couch" "decision_vals.face"
## [7] "decision_vals.flower" "decision_vals.guitar" "decision_vals.hand"
## [10] "decision_vals.kiwi"
# see how accurate the predictions are (chance is 1/7)
predictions_at_100ms <- dplyr::filter(predictions, test_time == "time.100_250")
mean(predictions_at_100ms$actual_labels == predictions_at_100ms$predicted_labels)
## [1] 1
Result metrics take the predictions made by a classifier and aggregate the results so that they can be interpreted.
To create a result metric two methods must be implemented
aggregate_CV_split_results()
which aggregates the results
after a set of cross-validation sweeps have been completed and
aggregate_resample_run_results()
which aggregates the final
results across all the resample runs.
The aggregate_CV_split_results()
method takes a data
frame that is a concatenation of the prediction data frames made by the
classifier (CL) objects across all times and cross-validation splits in
one resample run. Thus the input data frame to the
aggregate_CV_split_results()
method has similar variables
as in the output of the CL get_predictions()
method,
namely:
CV
: a number indicating which cross-validation split
the current row comes from
train_time
: the train time that the current row
comes from.
test_time
: the test time that the current row comes
from.
actual_labels
: the labels that were actually present
on each trial.
predicted_labels
: the labels that the classifier
predicted.
decision_vals.X
(optional): a group of variables
that has values that indicate how strongly the classifier rates a test
point as coming from a particular class
The output of the aggregate_CV_split_results
is a RM
object of the same type that contains inherits from a data frame so that
the results can be can be aggregated together (e.g., via rbind) across
resample runs. The variables in the data frame can be anything that is
useful to capture about the classification performance.
The aggregate_resample_run_results()
method takes result
metric data frames that have been aggregated together (e.g., via rbind)
across resample runs. Thus this input data frame as the same variables
as the data frame returned by the
aggregate_CV_split_results()
along with one additional
variable indicating which resample run each row comes from.
The output of this method should be a RM object of the same type that is a data frame which most likely is of a smaller size.
Like all NDR objects, RM objects must also implement a
get_properties()
method which returns a data frame with one
row that lists all the properties that have been set to allow for
reproducible research.
RM objects can also have plot methods to allow the different aggregated results to be plotted
Examples of using result metrics can be seen by going through the Introduction tutorial
Cross-validators take a classifier (CL), a datasource (DS) feature preprocessors (FP) objects, and result metric (RM) objects and they run a cross-validation decoding scheme by training and testing the classifier with data generated from the datasource object (and possibly fed through the feature pre-processing first).
All cross-validators must implement run_decoding()
method. This method does not take any additional arguments (apart from
the cross-validator itself).
The cross-validator returns a list DECODING_RESULTS
which contains different RM objects that can be used to assess how
accurately the classifier made predictions at different points in
time.
Like all NDR objects, CV objects must also implement a
get_properties()
method which returns a data frame with one
row that lists all the properties that have been set and also pulls all
properties from the other NDR objects (e.g., from the DS, FP, CL and RM
objects) to allow for reproducible research.
Examples of using the cv_standard
can be seen by going
through the Introduction
tutorial