Question: In this project, we will analyze a single - cell RNA - seq dataset, with the goal of unveiling hierarchic al structure and discovering important
In this project, we will analyze a singlecell RNAseq dataset, with the goal of unveiling hierarchic
al structure and discovering important genes. The datasets provided are all different subsets of a larger singlecell RNAseq dataset, compiled by the Allen Institute. This data contains cells from the mouse neocortex, a region in the brain which governs higherlevel functions such as perception and cognition.
The singlecell RNAseq data comes in the form of a counts matrix, where
each row corresponds to a cell
each column corresponds to the normalized transcript compatibility count TCC of an equivalence class of short RNA sequences, rescaled to units of counts per million. You can think of the TCC entry at location of the data matrix as the level of expression of the th gene in the th cell.
Download geneanalysisdata.tar.gzIf you don't know how to open it try WinZip or zip. The data is provided in three folders:
p which is a small, labeled subset of the data. It contains the count matrix along with ground truth" clustering labels which were obtained by scientists using domain knowledge and statistical testing. This is for use in Problem
punsupervised, which contains only a count matrix. This is for use in Problem
pevaluation, which contains a labeled training and test set. This is for use in Problem to evaluate feature selection.
The punsupervisedreduced and pevaluationreduced folders contain datasets with a reduced number of genes, in case you are unable to run some of the procedures on the larger versions. In particular, a full logistic regression could take or GB of memory to run.
In Problem autograded you will explore a small subset of the data, using visualization and clustering methods to discover its structure.
In Problem written reportpeer review you will use the tools you had from Problem to explore a larger subset of the data. Using clustering combined with logistic regression, you will discover informative features which can be used to distinguish cells of different types.
Finally, in Problem written reportpeer review you will revisit openended decisions you made in your analyses, such as TSNE hyperparameters or number of clusters chosen, and explore how robust your end results are to these potentially ambiguous decisions.
Hint: The data are only available in npy files. For people who prefer csv format, you can use a few lines of Python to convert npy files to csv files as below.
import numpy as npimport pandas as pdXnploaddatamathbfPX npypdDataFrameXtocsvdatamathbfPX csv
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
