Question: In python: This data set comes from: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data After a biopsy of a tumor tissue tests are run on the tumor cells to determine a

In python:

This data set comes from:

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

After a biopsy of a tumor tissue tests are run on the tumor cells to determine a diagnosis of benign or malignant. The tests result in 30 different cell attribute measurement values. Some or the measured aspects are radius_mean, perimeter_mean, area_mean which measure the mean value of cell radius (distance from center point), perimeter, and area. Looking at the web page and the data set you can see the other 27 different values. Based on these 30 values, a formula is applied to determine with there tumor is malignant or benign.

The two following files show the data:

breastCancerDataReducedDimensions.cvs: Only the first 4 attributes (you can just read this file instead of the file containing the entire set)

breastCancerData.csv: The full data set

Note, the first column is the sample id. The second column is the diagnosis for the sample, where M means malignant and B means benign.

At lunch one day, you and a medical technician come up with the idea that all this data and complicated formula are not needed. Instead, you decide you just need to look at the first four metrics {radius, texture, perimeter, area} means. The process is as follows:

a) strip the data to only consider those 4 values

b) Create four data files: q3_gte_13: third attribute - those data samples whose radius value is >= 13 q4_gte_18: fourth attribute - those data samples whose texture value is >= 18 q5_gte_85: fifth attribute - those data samples whose perimeter value is >= 85 q6_gte_500: sixth attribute - those data samples whose area value is >= 500

c) Find the data ids that are in each of these four files. The idea is that if a data sample exceeds the threshold (13, 18, 85, and 500) for each of these 4 attributes then the tumor is malignant. If the data does not exceed any of these attributes, then the tumor is benign. If the tumor exceeds some, but not all, of these thresholds then the tumor could be either benign or malignant.

To do this, you need to take the intersection of 4 files where the files have the ids of these data sets.

The four files (q3_gte_13, q4_gte_18, q5_gte_85, q6_gte_500) also have the diagnosis of B or M from the original test included. You want to test the quality of your process.

To do this create 2 versions for each of the 4 dimensions:

q3_B, q3_M

Where B means the data has been diagnosed as Benign and M means it was diagnosed as Malignant.

Likewise for columns 4, 5, and 6 giving files:

q4_B, q4_M

q5_B, q5_M

q6_B, q6_M

With these files you can now test how well your new easier process works.

Let file NewResult contain the intersection of ids from the four files (q3_gte_13, q4_gte_18, q5_gte_85, q6_gte_500), i.e. regardless of whether the original methods said M or B.

There are two ways to test this new process.

Method 1:

Compare NewResult to the data found in the 4 files you created of ids of M data: q3_M, q4_M, q5_M, and q6_M. Let SubsetMResult contain the data that is the union of the four files (q3_M, q4_M, q5_M, and q6_M). Then, calculate:

Difference_1 = SubsetMResult - NewResult

If your new method is capture all the same data, then Difference should be the empty set.

Method 2:

From the original data set, breastCancerData.csv, find all the ids that are marked M. Call this set OriginalResult.

Difference_2 = OriginalResult - NewResult

Again, if your new method is capture all the same data, then Difference should be the empty set.

To turn in for Q4 all write files:

contents of difference 1 (in sorted order)

contents of difference 2 (in sorted order)

Written Responses:

What is the proportion of observations in Original Result with DIAGNOSIS = M?

What is the length of SubsetMResult and NewResult?

What is the length of difference 1 and difference 2?

If the length of difference_1 and difference_2 are not the same, what accounts for this difference?

What are the implications of selecting the Method 1 (NewResult) vs. Method 2 (SubsetMResult), given the type of data used in this problem?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Python! I reduced the same list down tremendously, but just wanted to get a general idea of a concept. After a biopsy of a tumor tissue tests are run on the tumor cells to determine a diagnosis of...

In Python! I reduced the same list down tremendously, but just wanted to get a general idea of a concept. After a biopsy of a tumor tissue tests are run on the tumor cells to determine a diagnosis of...

CORLEY v. STATE DEPARTMENT OF HEALTH HOSPITALS (1999) Court of Appeal of Louisiana,Second Circuit. Sheila CORLEY, et al., Plaintiffs-Appellees, v. STATE of Louisiana, DEPARTMENT OF HEALTH &...

dont use AI tools for plagarism free and write the whole python code Data Set Description The data contains the information about various factors which can influence salary levels such as experience,...

Can help how to write Python code to answer those questions? 1. In below two datasets "Athletes" (contains details about the participating Athletes) and "Coaches" (details about coaches, countries...

You will work with the Nutrition_subset data set in this assignment, using Python. The data set contains the weight in grams along with the amount of saturated fat and the amount of cholesterol for a...

Write the code in python Training data set 5.0,3.0,1.6,0.2,Iris-setosa 5.0,3.4,1.6,0.4,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris-setosa 4.7,3.2,1.6,0.2,Iris-setosa...

For the given data from the amazon reviews, perform the below activities: 1. Clean the data 2. Plot a bigram bar graph on the top words 25 words 3. Find customer concern areas - the top 25 bigrams...

PLEASE HELP! PYTHON! and data set is: -0.475858, 1.61655 -0.291753, 0.184802 -0.156001, 0.518786 -0.155835, 0.175825 -0.0830172, 0.716709 -0.0378482, -0.175242 -0.0354125, -0.419593 -0.031637,...

Data set python The data set contains 1197 instances, each of which have 15 columns: the first 14 columns corresponding to the attributes, the 15th column ``actual_productivity'' is the variable that...

Why was the Representation of the People Act of 1928 passed?

Foreman Company issued $800,000 of 10%, 20-year bonds on January 1, 2012, at 119.792 to yield 8%. Interest is payable semiannually on July 1 and January 1. Instructions Prepare the journal entries to...

Required: Define the multiplicities indicated by the YNF Boston system for the purchases and cash disbursements process. Assume that agents and resources are populated before participating in events....

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

8. Empowerment requires delegation; if your boss does not give it to you, then it is not possible to make it happen.

10. If you have good relationships with the people in your unit, then it is not necessary to have good relationships with people in other parts and levels of the organization.

5. It often is a good idea to make others dependent on you for your expertise and knowledge.