Question: Project - 3 : Predicting Algae Blooms Problem Description and Objectives High concentrations of certain harmful algae in rivers constitute a serious ecological problem with
Project: Predicting Algae Blooms
Problem Description and Objectives
High concentrations of certain harmful algae in rivers constitute a serious ecological problem with a strong impact not only on river lifeforms, but also on water quality. Being able to monitor and perform an early forecast of algae blooms is essential to improving the quality of rivers.
With the goal of addressing this prediction problem, several water samples were collected in different European rivers at different times during a period of approximately year. For each water sample, different chemical properties were measured as well as the frequency of occurrence of seven harmful algae. Some other characteristics of the water collection process were also stored, such as the season of the year, the river size, and the river speed.
One of the main motivations behind this application lies in the fact that chemical monitoring is cheap and easily automated, while the biological analysis of the samples to identify the algae that are present in the water involves microscopic examination, requires trained manpower, and is therefore both expensive and slow. As such, obtaining models that can accurately predict the algae frequencies based on chemical properties would facilitate the creation of cheap and automated systems for monitoring harmful algae blooms.
Another objective of this study is to provide a better understanding of the factors influencing the algae frequencies. Namely, we want to understand how these frequencies are related to certain chemical attributes of water samples as well as other characteristics of the samples like season of the year, type of river, etc.
Data Description
Training Data: The training data Analysistxt consists of data for water samples. Each observation contains information on variables. Three of these variables are nominal and describe the season of the year when the water samples to be aggregated were collected, as well as the size and speed of the river in question. The eight remaining variables are values of different chemical parameters measured in the water samples forming the aggregation, namely:
Maximum value
Minimum value of oxygen
Mean value of chloride
Mean value of nitrates
Mean value of ammonium
Mean of orthophosphate
Mean of total POphosphate
Mean of chlorophyll
Associated with each of these parameters are seven frequency numbers of different harmful algae found in the respective water samples. No information is given regarding the names of the algae that were identified.
The second dataset Evaltxt contains information on extra observations. It uses the same basic structure, but it does not include information concerning the seven harmful algae frequencies. These extra observations can be regarded as a kind of test set. The main goal of our study is to predict the frequencies of the seven algae for these water samples. This means that we are facing a predictive data mining task. This is one among the diverse set of problems tackled in data mining. In this type of task, our main goal is to obtain a model that allows us to predict the value of a certain target variable given the values of a set of predictor variables. This model may also provide indications on which predictor variables have a larger impact on the target variable; that is the model may provide a comprehensive description of the factors that influence the target variable.
Note: Unknown values are indicated with the string XXXXXXX
The Experiment
In this project you must do the following tasks either individually or with a partner
Task: Convert the text files into CSV files and convert them into Pandas' data frames: Analysis.txt into training data frame and Eval.txt into testing data frame.
Task: Remove the unknown values indicated with the string XXXXXXX and make them as NA values in your data frames.
Task: Use the statistical summary to conduct exploratory data analysis. Summarize your findings about the dataset including normality
Task: Use the graphical plots Histogram QQ plot, Boxplot, etc. to conduct exploratory data analysis. Summarize your findings about the dataset normality outliers, equality of variance, etc.,
Task: Remove the NA values in the data: Use mean value to replace the NA if the data is normally distributed; if the data is skewed then use the median to replace the data.
Task: Use the training dataset to build models to find the frequencies of each Algae types based on the chemical components present in the water.
Task: Using the trained model to find the frequencies of each Algae types present in the testing data set for the given chemical composition of the water.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
