Question: In this project, we will analyze a single - cell RNA - seq dataset, with the goal of unveiling hierarchic al structure and discovering important

In this project, we will analyze a single-cell RNA-seq dataset, with the goal of unveiling hierarchic
al structure and discovering important genes. The datasets provided are all different subsets of a larger single-cell RNA-seq dataset, compiled by the Allen Institute. This data contains cells from the mouse neocortex, a region in the brain which governs higher-level functions such as perception and cognition.
The single-cell RNA-seq data comes in the form of a counts matrix, where
each row corresponds to a cell
each column corresponds to the normalized transcript compatibility count (TCC) of an equivalence class of short RNA sequences, rescaled to units of counts per million. You can think of the TCC entry at location of the data matrix as the level of expression of the -th gene in the -th cell.
Download gene_analysis_data.tar.gz.(If you don't know how to open it, try WinZip or 7-zip.) The data is provided in three folders:
p1, which is a small, labeled subset of the data. It contains the count matrix along with ground truth" clustering labels , which were obtained by scientists using domain knowledge and statistical testing. This is for use in Problem 1.
p2_unsupervised, which contains only a count matrix. This is for use in Problem 2.
p2_evaluation, which contains a labeled training and test set. This is for use in Problem 2 to evaluate feature selection.
The p2_unsupervised_reduced and p2_evaluation_reduced folders contain datasets with a reduced number of genes, in case you are unable to run some of the procedures on the larger versions. In particular, a full logistic regression could take 1 or 2 GB of memory to run.
In Problem 1(autograded), you will explore a small subset of the data, using visualization and clustering methods to discover its structure.
In Problem 2(written report/peer review), you will use the tools you had from Problem 1 to explore a larger subset of the data. Using clustering combined with logistic regression, you will discover informative features which can be used to distinguish cells of different types.
Finally, in Problem 3(written report/peer review), you will revisit open-ended decisions you made in your analyses, such as T-SNE hyper-parameters or number of clusters chosen, and explore how robust your end results are to these potentially ambiguous decisions.
Hint: The data are only available in .npy files. For people who prefer .csv format, you can use a few lines of Python to convert .npy files to .csv files as below.
import numpy as npimport pandas as pdX=np.load("data\mathbf{P}1\X .npy")pd.DataFrame(X).to_csv("data\mathbf{P}1\X .csv")

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!