Question: The following code uses cross-validation in order to estimate predictive accuracy for a linear model of days-to-remission as a function of gene expression in ALL

The following code uses cross-validation in order to estimate predictive accuracy for a linear model of days-to-remission as a function of gene expression in ALL dataset. It runs to completion without errors but produces a number of warnings (shown below) about "differing numbers of rows" and "mismatches in object lengths."

Please explain the source of those warnings and how they can be cleaned up. Please also explain how whatever caused those warnings affects the output (if at all), and how and why (and if) the output changes upon fixing the code. (Hint: in order to observe these warnings you do not have to go through all 12K genes at each step of cross-validation - one percent of that amount is plenty - and it will save you a lot of time you would otherwise waste watching it run; remember also that R is an interpreter, so you can run commands one at a time if you need to and examine their outputs).

library(ALL) data(ALL) 
set.seed(1234) 
# calculate days-to-remission: 
ALL.pdat <- pData(ALL) date.cr.chr <- as.character(ALL.pdat$date.cr) diag.chr <- as.character(ALL.pdat$diagnosis) date.cr.t <- strptime(date.cr.chr,"%m/%d/%Y") diag.t <- strptime(diag.chr,"%m/%d/%Y") ALL.pdat$D2R <- as.numeric(date.cr.t - diag.t) 
# prepare the data structures: 
ALL.exprs <- exprs(ALL)[,!is.na(ALL.pdat$D2R)] ALL.pdat <- ALL.pdat[!is.na(ALL.pdat$D2R),] n.xval <- 5 s2.xval <- numeric() 
xval.grps <- sample(1:dim(ALL.pdat)[1]%%n.xval+1) 
# run over each cross-validation: 
for ( i.xval in 1:n.xval ) { min.pval <- 1.0 
 min.id <- NA train.exprs <- ALL.exprs[,xval.grps!=i.xval] train.d2r <- ALL.pdat[xval.grps!=i.xval,"D2R"]  # evaluate each gene in the training dataset to find the one # most associated with the outcome:  for( i in 1:dim(train.exprs)[1]) {  ###for( i in 1:100 ) { 
 p.val <- anova(lm(train.d2r~train.exprs[i,],))[1,"Pr(>F)"] if ( p.val < min.pval ) { 
 min.pval <- p.val 

min.id <- i }

}

 # print the gene found: 
 cat(rownames(train.exprs)[min.id],min.pval,fill=T) 
 # refit the model for best gene found on training dataset: 
 best.lm.xval <- lm(train.d2r~train.exprs[min.id,]) 
 # calculate predictions on test dataset: 
 test.exprs <- ALL.exprs[,xval.grps==i.xval] test.d2r <- ALL.pdat[xval.grps==i.xval,"D2R"] test.pred <- predict( 
 best.lm.xval,data.frame(t(test.exprs),test.d2r) ) 
 # accumulate squared errors of prediction: 
 s2.xval <- c(s2.xval,(test.pred-test.d2r)^2) } 
40176_at 1.433363e-05 35296_at 8.721938e-07 1213_at 3.760985e-06 34852_g_at 2.161217e-06 33901_at 1.399374e-06 Warning messages: 
1: 'newdata' had 19 rows but variables found have 77 rows 2: In test.pred - test.d2r : 
 longer object length is not a multiple of shorter object length ... 
# print average squared error in cross-validation: 
 mean(s2.xval) 

[1] 332.7707

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!