Question: Source of data: Golub et al . ( 1 9 9 9 ) . Molecular classification of cancer: class discovery and class prediction by gene

Source of data: Golub et al.(1999). Molecular classification of cancer: class discovery and class prediction by
gene expression monitoring, Science, Vol. 286:531-537.
The data set golub consists of the expression levels of 3051 genes for 38 tumor mRNA samples. Each tumor
mRNA sample comes from one patient (i.e.38 patients total), and 27 of these tumor samples correspond to
acute lymphoblastic leukemia (ALL) and the remaining 11 to acute myeloid leukemia (AML).
You will need to discover how many genes can be used to differentiate the tumor types (meaning that their
expression level differs between the two tumor types) using
the uncorrected p-values,
the Holm-Bonferroni correction, and (iii) the Benjamini-Hochberg correction?
Feel free to use libraries for multiple hypothesis testing in R or python.
If you are using Python, you can use the following code to load the data:
with zipfile.ZipFile("statsreview_release1.zip") as zip_file:
golub_data, golub_classnames =( np.genfromtxt(zip_file.open('data_and_materials/
golub_data/{}'.format(fname)), delimiter=',', names=True, converters={0: lambda s:
int(s.strip(b'"'))}) for fname in ['golub.csve, 'golub_cl.csv'])
Part (a)
0.02.0 points (graded)
Let xALL,i be the mean of the expression levels for gene i across the ALL mRNA samples. Similarly, let
xAML,i be the same but for the AML mRNA samples instead.
For each of these, NALL and NAML are the number of mRNA samples for the ALL tumors and AML tumors
respectively.
If sALL,i2 is the sample variance for gene i across the ALL mRNA observations, then the corresponding
variance for xALL,i is
sxAML,22=sAML,i2NAML
We can use xi=xALL,i-xAML,i as a metric for the difference in expression levels for gene i. The
variance of this metric is
sxi2=sxALL,i2+sxAML,i2
This allows us to use the following test statistic:
tWelch,i=xALL,i-xAML,isALL,i2NALL+sAML,i2NAML2
which you can recognize as similar to the t-test statistic, and is itself known as the Welch unequal variances t-
test.
The distribution for the Welch test statistic can be approximated by a t-distribution, but with a modified
number of degrees of freedom. The number of degrees of freedom is approximately
)i)ALL)AML
where )ALL and )AML.
Use the Welch t-test to find the number of significantly associated genes (0.05) using uncorrected p-
values.
How many genes are significant? (Please enter the value with a precision of at least two significant figures,
your answer will be graded with a 10% tolerance.)
 Source of data: Golub et al.(1999). Molecular classification of cancer: class

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!