y = atan ((x2 x3 - (1/(x2 x4)))/x1) + errorThe the [mlbench](http://cran.r-project.org/web/packages/mlbench/index.html) package has this in R code as well as other simulation systems. I've been looking for a system that can be used to test a few different aspects of classification models: * class imbalances * non-informative predictors * correlation amoung the predictors * linear and nonlinear signals I spent a few hours developing up with one. It models the log-odds of a binary event as a function of informative and non-informative predictors. The true signals are additive "sets" of a few different types. First, there are two main effects and an interaction:

intercept - 4A + 4B + 2ABThe intercept is a parameter for the simulation and can be used to control the amount of class imbalance. The second set of effects are linear with coefficients that alternate signs and have values between 2.5 and 0.025. For example, if there were six predictors in this set, the contribution to the log-odds would be

-2.50C + 2.05D -1.60E + 1.15F -0.70G + 0.25HThe third set is a nonlinear function of a single predictor ranging between [0, 1] called J here:

(J^3) + 2exp(-6(J-0.3)^2)I saw this in one of [Radford Neal](http://www.cs.utoronto.ca/~radford/)'s presentations but I couldn't find an exact reference for it. The equation produces an interesting trend: ```{r nonlin,message=FALSE,results='hide',fig.height=4.25,fig.width=6, echo = FALSE} nonlinData <- data.frame(Predictor = seq(-1, 1, length = 500)) nonlinData$Log_Odds <- (nonlinData$Predictor^3) + 2 * exp(-6*(nonlinData$Predictor - 0.3)^2) qplot(Predictor, Log_Odds, data = nonlinData, geom = "line") ``` The fourth set of informative predictors are copied from one of Friedman's systems and use a set of two (`K` and `L`):

2sin(KL)All of these effects are added up to model the log-odds. This is used to calculate the probability of a sample being in the first class and a random uniform number is used to actually make the assignment of the actual class. We can also add non-informative predictors to the data. These are random standard normal predictors and can be optionally added to the data in two ways: a specified number of independent predictors or a set number of predictors that follow a particular correlation structure. The only two correlation structure that I've implemented are * compound-symmetry (aka exchangeable) where there is a constant correlation between all the predictors * auto-regressive 1 [AR(1)]. While there is no time component to these data, we can use this structure to add predictors of varying levels of correlation. For example, simulating ten predictors with a correlation parameter of 0.75 yields the following between-predictor correlation structure: ```{r corrplot,message=FALSE,results='hide',fig.height=5,fig.width=5, echo = FALSE} set1 <- twoClassSim(2000, corrVars = 10, corrValue = .75) cor1 <- cor(set1[, grepl("Corr", names(set1))]) corrplot(cor1) ``` To demonstrate, let's take a set of data and see how a support vector machine performs: ```{r sim,cache=TRUE} set.seed(468) training <- twoClassSim( 300, noiseVars = 100, corrVar = 100, corrValue = .75) testing <- twoClassSim( 300, noiseVars = 100, corrVar = 100, corrValue = .75) large <- twoClassSim(10000, noiseVars = 100, corrVar = 100, corrValue = .75) ``` The default for the number of informative linear predictors is 10 and the default intercept of -5 makes the class frequencies fairly balanced: ```{r class,cache=TRUE} table(large$Class)/nrow(large) ``` We'll use the

trainfunction to tune and train the model: ```{r fullSet,cache=TRUE} library(caret) ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary) set.seed(1254) fullModel <- train(Class ~ ., data = training, method = "svmRadial", preProc = c("center", "scale"), tuneLength = 8, metric = "ROC", trControl = ctrl) print(fullModel, digits = 3) ``` Cross-validation estimates the best area under the ROC curve to be `r I(round(caret:::getTrainPerf(fullModel)["TrainROC"], 3)[,1])`. Is this an accurate estimate? The test set has: ```{r fullSetTest,cache=TRUE} fullTest <- roc(testing$Class, predict(fullModel, testing, type = "prob")[,1], levels = rev(levels(testing$Class))) fullTest ``` For this small test set, the estimate is `r I(round(abs(caret:::getTrainPerf(fullModel)["TrainROC"][,1] - as.vector(auc(fullTest))), 3))` larger than the resampled version. How do both of these compare to our approximation of the truth? ```{r fullSetLArge,cache=TRUE} fullLarge <- roc(large$Class, predict(fullModel, large, type = "prob")[,1], levels = rev(levels(testing$Class))) fullLarge ``` How much did the presence of the non-informative predictors affect this model? We know the true model, so we can fit that and evaluate it in the same way: ```{r trueSet,cache=TRUE} realVars <- names(training) realVars <- realVars[!grepl("(Corr)|(Noise)", realVars)] set.seed(1254) trueModel <- train(Class ~ ., data = training[, realVars], method = "svmRadial", preProc = c("center", "scale"), tuneLength = 8, metric = "ROC", trControl = ctrl) print(trueModel, digits = 3) ``` Much higher! Is this verified by the other estimates? ```{r trueSetred} trueTest <- roc(testing$Class, predict(trueModel, testing, type = "prob")[,1], levels = rev(levels(testing$Class))) trueTest trueLarge <- roc(large$Class, predict(trueModel, large, type = "prob")[,1], levels = rev(levels(testing$Class))) trueLarge ``` At this point, we might want to look and see what would happen if all 200 non-informative predictors were uncorrelated etc. At least we have a testing tool to make objective statements. Code to create this can be found here and will end up making its way into the [caret](http://cran.r-project.org/web/packages/caret/index.html) package. Any suggestions for simulation systems?