Question: Note that examples of problems for you to find and solve can be: - Identify which suburb / location had the biggest growth in SalePrice

Note that examples of problems for you to find and solve can be:
- Identify which suburb/location had the biggest growth in SalePrice by plotting and
examining the sale prices cross different suburbs;
- Analyse a possible pattern of SalePrice vs YrSold/MoSold, LotArea and/or some other
variables which can reasonably be included;
- Use predictions from your final model to compare suburbs which have shown
varying growth. Or, to identify which suburbs have been growing the most over the
last few years.
UG students (unit 11374): Generate and address at least five problems.
G students (unit 11517): Generate and address at least seven problems, including the last
problem listed above which uses predictions from your final model, e.g. find a way to
compare the predictions (maybe median?) between suburbs (could be the top 5 suburbs)
which have shown varying growth from your time series plots of growth over time.
2. Data preprocessing:
In this section you should:
- Preprocess your code, treat missing values etc.
- Note at least one key observation, e.g. identified possible missing values or outliers
for a particular area/suburb or year e.g.2016 is significantly higher. Or perhaps one
column is missing more than 50% of its values.
3. EDA:
In this section you should:
- Include tasks such as determining which variables are significant, which observations
may be outliers etc., and other EDA goals.
- Find as much insight as possible to support your modelling decisions later on.
- Use data visualisation techniques taught in the unit to answer your chosen problems
of interest.
4. Further preprocessing:
In this section you should:
- Select the final variables for your model based off your EDA (basically remove the
non-significant variables).
- Create any new variables which you think may help based on your EDA in this
section.
- Justify your decisions and provide EDA evidence as to how a variable is insignificant
(e.g. no observable relationship to target variable in scatter plot).
5. Modelling:
In this section you should:
- Fit and evaluate a linear model to describe the relationship between your target
variable and a number of selected significant predictors.
- Use your model to predict the prices of properties described by your test dataset.
Alternatively, you may use another, more advanced model of your choice. If you do use a
linear model, remember its likings such as a normalised distribution in the target variable.
6. Evaluation:
You should:
- Evaluate your model against the metric RMSE given the actual values in the test
dataset
- Plot the residuals similar to that shown in the Week 10 slides. Pick a suitable cut off
value for the red dots.
The data science methodology is an iterative process. Try to minimise your RMSE, so always
go back and think about what improvements can be made, then fit another model, and find
your second RMSE, and so on, noting what works and what does not. Compare at least two
different models you considered, noting their differences.
7. Recommendations and final conclusions:
You should:
- Summarise your findings and provide your found solutions to your problems of
interest. Note anything you found particularly interesting and useful to your project.
- State the best RMSE you obtained and why/how (i.e. what variables you used, any
applied transformations etc.).
- State any improvements you could make and why/how you could achieve such
improvements in future works.
8. References:
You should:
- Include a reference list and cite your references via in-text referencing or footnotes.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!