Your task will be to analyse the data in a three step pipeline and then focus on the interpretation of results in step 4 Step 1 For each variety ( x and Y ) , combine the gene expression data and process into a single file ( 1 2 ) Step 2 Perform initial processing to get the gene expression data into a usable format ( 1 2 ) Step 3 Manipulate and visualise the data to address the aims of the study ( 2 4 ) Step 4 Complete an MCQ based on your results and their biological interpretation ( 1 2 ) Input Files You will see that the Unit 8 Matrix directory contains the following files and folders you will need for this assessment The gene expression data is currently split up into several files within various subdirectories of the rds homes u username module 1 intro to biology and programming Unit 8 Matrix directory Your task will be to analyse the data in a three step pipeline and then focus on the interpretation of results in step 4 Step 1 For each variety ( x and Y ) , combine the gene expression data and process into a single file ( 1 2 ) Step 2 Perform initial processing to get the gene expression data into a usable format ( 1 2 ) Step 3 Manipulate and visualise the data to address the aims of the study ( 2 4 ) Step 4 Complete an MCQ based on your results and their biological interpretation Write a Bash script that can be run within the Matrix directory to process the gene expression matrices to produce the following two output files A single csv file containing the gene expression matrix for variety X , including the header line You should name this file GEM X csv ( GEM stands for Gene Expression Matrix ) A single csv file containing the gene expression matrix for variety Y , including the header line You should name this file GEM Y csv In the file headers, each sample name for the columns of the matrix currently has the following format Variety, followed by condition code ( C , 1 , 2 or 3 ) and biological replicate ( a , b or c ) C is the control, and 1 3 are the stress conditions For example, denotes Variety X , Treatment C and replicate a Your gene expression matrix files should Contain only unique genes located on the 1 2 chromosomes, with genes sorted by chromosome Change the column labels in the header to a more convenient format, replacing a with Rep 1 , b with Rep 2 and c with Rep 3 In the above example, becomes Your submitted Linux script should be called Linux your initials sh ( e g Linux LC sh ) Now you have collected all the data into a single gene expression matrix for each Variety, you should use Python to select the gene expression data from two conditions only and change the format into something that will be easier to analyse in R Create a Jupyter Notebook or a plain Python script which takes the following inputs The two gene expression matrices you created in Step 1 for varieties X and Y The two files containing information on the differentially expressed genes And gives the following outputs Two gene expression matrices ( one for each Variety ) , each with a header, and containing gene expression from two conditions only the control condition ( in columns 2 4 replicates 1 3 ) and stress treatment condition 1 ( columns 5 7 replicates 1 3 ) You should name your files all VarX TwoTimePoints csv and all VarY TwoTimePoints csv Two files, each with a header, containing information on the differentially expressed genes for each variety, and containing a subset of columns in this order gene name, log 2 FoldChange, padj, Athaliana geneID and Gene Function, followed by an additional six columns for the expression of the three replicates in each of the two conditions in the order given above You should name your files Leaf DEGs VarX csv and Leaf DEGs VarY csv The meaning of the selected columns relating to differential gene expression are given below log 2 FoldChange The logarithm to the base 2 of the expression ratio between the treatment condition and the control condition This is a convenient way to quantify the change in expression between two conditions because a log 2 FC 1 means the expression has doubled in the treatment condition whereas log 2 FC 1 would mean the expression has halved padj An adjusted p value You will learn more about the p value in a future module For now, this probability value gives a measure of how likely it is that the gene expression change is real instead of being seen by chance The lower the p value, the more significant we deem the change in expression to be Athaliana geneID The corresponding gene in the model plant Arabidopsis thaliana This can be very useful information for downstream functional analysis of genes when working with crops like potatoes Gene Function Functional information available for the gene ( what does it do )

The Answer is in the image, click to view ...

Question: Your task will be to analyse the data in a three - step pipeline and then focus on the interpretation of results in step 4

Your task will be to analyse the data in a three

-

step pipeline and then focus on the interpretation

of results in step

4

Step

1

: For each variety

(x

and

Y),

combine the gene expression data and process into a

single file

(12 %) .

Step

2

: Perform initial processing to get the gene expression data into a usable format

(12 %) .

Step

3

: Manipulate and visualise the data to address the aims of the study

(24 %) .

Step

4

: Complete an MCQ based on your results and their biological interpretation

(12 %) .

Input Files

You will see that the Unit

8 /

Matrix

/

directory contains the following files and folders you will need

for this assessment: The gene expression data is currently split up into several files within various subdirectories of the

/

rds

/

homes

/

/

username

/

module

- 1 -

intro

-

-

biology

-

and

-

programming

/

Unit

8 /

Matrix directory. Your task will be to analyse the data in a three

-

step pipeline and then focus on the interpretation

of results in step

4

Step

1

: For each variety

(x

and

Y),

combine the gene expression data and process into a

single file

(12 %) .

Step

2

: Perform initial processing to get the gene expression data into a usable format

(12 %) .

Step

3

: Manipulate and visualise the data to address the aims of the study

(24 %) .

Step

4

: Complete an MCQ based on your results and their biological interpretation

Write a Bash script that can be run within the Matrix

/

directory to process the gene expression matrices to produce the following two output files:

A single csv file containing the gene expression matrix for variety X

,

including the header line. You should name this file GEM

_

.

csv

(

GEM stands for Gene Expression Matrix

) .

A single csv file containing the gene expression matrix for variety Y

,

including the header line. You should name this file GEM

_

.

csv

In the file headers, each sample name for the columns of the matrix currently has the following format:

Variety, followed by condition code

(

, 1, 2

3)

and biological replicate

(

,

b or

) .

C is the control, and

1 - 3

are the stress conditions. For example,

denotes Variety X

,

Treatment C and replicate

.

Your gene expression matrix files should:

Contain only unique genes located on the

12

chromosomes, with genes sorted by chromosome.

Change the column labels in the header to a more convenient format, replacing

a with Rep

. 1, -

b with Rep

. 2

and

c with Rep

. 3 .

In the above example,

becomes

.

Your submitted Linux script should be called Linux

_

your initials.sh

(

.

.

Linux

_

.

)

Now you have collected all the data into a single gene expression matrix for each Variety, you should use Python to select the gene expression data from two conditions only and change the format into something that will be easier to analyse in R

.

Create a Jupyter Notebook or a plain Python script which takes the following inputs:

The two gene expression matrices you created in Step

1

for varieties X and Y

.

The two files containing information on the differentially expressed genes.

And gives the following outputs:

Two gene expression matrices

(

one for each Variety

),

each with a header, and containing gene expression from two conditions only: the control condition

(

in columns

2 - 4

; replicates

1 - 3)

and stress treatment condition

1 (

columns

5 - 7

; replicates

1 - 3) .

You should name your files all

_

VarX

_

TwoTimePoints.csv and all

_

VarY

_

TwoTimePoints.csv

.

Two files, each with a header, containing information on the differentially expressed genes for each variety, and containing a subset of columns in this order: gene

_

name, log

2

FoldChange, padj, Athaliana

_

geneID and Gene

_

Function, followed by an additional six columns for the expression of the three replicates in each of the two conditions in the order given above. You should name your files Leaf

_

DEGs

_

VarX.csv and Leaf

_

DEGs

_

VarY.csv

.

The meaning of the selected columns relating to differential gene expression are given below:

log

2

FoldChange: The logarithm to the base

2

of the expression ratio between the treatment condition and the control condition. This is a convenient way to quantify the change in expression between two conditions because a log

2

= 1

means the expression has doubled in the treatment condition whereas log

2

= - 1

would mean the expression has halved.

padj: An adjusted p

-

value. You will learn more about the p

-

value in a future module. For now, this probability value gives a measure of how likely it is that the gene expression change is "real" instead of being seen by chance. The lower the p

-

value, the more "significant" we deem the change in expression to be

.

Athaliana

_

geneID: The corresponding gene in the model plant Arabidopsis thaliana. This can be very useful information for downstream functional analysis of genes when working with crops like potatoes.

Gene

_

Function: Functional information available for the gene

(

what does it do

?) .

Your task will be to analyse the data in a three

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Journal of Information Technology Education Volume 6, 2007 The Delphi Method for Graduate Research Gregory J. Skulmoski Zayed University, Dubai, United Arab Emirates Francis T. Hartman and Jennifer...

ARTICLE 2. RESISTERS AT WORK: GENERATING PRODUCTIVE RESISTANCE IN THE WORKPLACE. Introduction Resistance has had a checkered career. In the popular imagination, after the experiences of various...

Summary of the key pointsuse bullets only1 page MARKETING RESEARCH DASHBOARD MEASURING EFFECTIVENESS OF ONLINE ADVERTISING FORMATS To help companies make better advertising placement decisions on...

ID Salary 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 59.3 26.5 34.1 56.3 46.8 79.5 41.2 23.6 76 22.7 23.8 62.9 41.4 22.3 23.1 46.6 67.6 33.8 24 35 75.2 51.5 22.1 55.4 24.9 24.3 42.9 76.6 75.4...

ID Salary Compa- Midpoint ratio Age Performance Service Gender Rating Raise Degree Gender Grade 1 Copy Employee Data set to this page. The ongoing question that the weekly assignments will focus on...

Set Week Three During this week, we will look at ways of testing multiple (more than two) data samples at the same time. We will continue to use the data and assignment file that we opened in Week 2,...

I need to see the SPSS output. You need to have all z-scores, all charts, all descriptives data from SPSS, everything you used to answer the questions. I am sending you what the previous tutor sent...

Confirming Pages C H A P T E R 19 Analyzing Information and Writing Reports Chapter Outline Using Your Time Efficiently Analyzing Data and Information for Reports Identifying the Source of the Data...

Part A: Qualitative Research (30 points) Section 1: Reading, Memo Writing and Categorizing (20 points) This portion of the assignment is designed to help you develop/employ key qualitative research...

Fabricator Inc., a specialized equipment manufacturer, uses a job order costing system. The overhead is allocated to jobs on the basis of direct labor hours. The overhead rate is now $ 3,000 per...

The following information relates to Curios Limited: Year end 30 June 2019 2018 Profit before tax 282 000 252 000 Dividend income (not taxable) 8 000 10 000 Non-deductible donations 10 000 5 000...

Clients should have the freedom NOT to participate in any given exercise or activity ? True or False?

Complex issue identification report Research and report on the following: In this activity, you need to prepare a report on the objective of identifying the issues and what risk is associated with...