Question: Before applying any algorithm to a real - world problem, it is common practice to first understand its behavior on synthetic data. Let s generate

Before applying any algorithm to a real-world problem, it is common practice to first understand its behavior on synthetic data. Lets generate a synthetic data set for a regression problem: given input-output pairs, we aim to learn a function that maps inputs to outputs.
Suppose we have input
and output
that are related by the equation
.
However, in practice, we can only observe noisy data. That is, given input
, we observe output
that is related to
by the equation
, where
is a random noise term. A common model for
is a Gaussian random variable with mean 0 and variance
.
Suppose
. Generate 10 data points for
uniformly randomly distributed in the range of
. Make a plot that contain the following elements:
(1) The ground truth function.
(2) The noisy data points.
(3) Add a legend to the plot, where the ground truth is labeled as Ground truth and the noisy data points are labeled as Noisy data. Label the x-axis as x and the y-axis as y.
A sample figure is shown below.
# code here
../_images/a3011438376c90858f866272f856ad29e463b2dbf1ef86e4999129fa739fae47.png
Q2
We usually denote a normal distribution with mean
and variance
as
.
Lets generate a synthetic data set for a classification problem. Let X be a random variable that follows a normal distribution
and Y be a random variable that follows a normal distribution
. X and Y can be some feature of two groups. For example, the height of high school students and the height of college students. Different groups can have different distributions of the same feature.
Generate 1000 samples for X and 500 samples for Y. This models the scenario where we have more data for one group than the other.
(1) Plot the histograms of X and Y in the same figure.
(2) Add label X samples to the histogram of X and label Y samples to the histogram of Y. Add a legend to the plot.
(3) The two histograms should have different color and set the transparency to 0.5 so that we can see the overlap of the two histograms.
Hint: the transparency is usually named as alpha in most plotting libraries.
A sample figure is shown below.
# code here
../_images/54d9c5e95db5a97e323e3c204399d24d054405d09aa9d5e42b31d9e4ba1fabef.png
Q3. Lets bring our data science skill to the Wall Street. One model of stock price is the random walk model:
Suppose
is the stock price at day
.
is the initial stock price. At each day, the change of stock price is a random variable
, which is normally distributed with mean
and variance
. The stock price at day
is
.
(1) Write a function stock_price_simulation, that take input
X0: the initial stock price
mu: the mean of the normal distribution
sigma: the standard deviation of the normal distribution
n: the number of days
Return a list (or numpy array) of stock prices at each day
.
# code here
(2) Take
. Sample 10 trajectories of the stock price and plot them in the same graph.
A sample figure is shown below.
# code here
../_images/4d3532783f6bbe597cb4c1585510a55c09c4ae61867a938093bafac287afb5d9.png
(3) Estimate the expectation and standard deviation of the stock price on day 100, using 1000 samples.
# code here
(4)(Challenge, not graded) A call option is a contract that allows you to buy a stock at a fixed price at a future date. Suppose you own a call option that allows you to buy a stock at day 100 at price 105(this is called the strike price).
If the stock price at day 100 is above 105. Then you can exercise the option, pay 105 to get the stock, and sell it at the market price to make a profit. Otherwise, you dont exercise the option and dont make a profit.
Estimate the probability that you can make a profit using the call option. Suppose youre the seller or the buyer of this call option. Estimate what should be the fair price of the call option.
# code here
Q4
One model of wealth inequality is the pareto distribution.
Lets generate N =1000 samples from a pareto distribution with parameters a=20 using the following code:
import numpy as np
N =1000
a =20
x = np.random.pareto(a, N)
You can think of x as samples of wealth of a population.
(1) Plot the histogram of the samples.
# code here
(2) The k-quantile of a distribution is the value such that k% of the samples are less than or equal to the value. For example, the 50-quantile is the median.
What are the median and the mean of the samples? What is the percentage of the population that are above-average wealthy?
# code here
(3) Estimate what percentage of the population owns more than 80% of the wealth?
Hint: you can sort the array such that
, and compute the cumulative sum of the array:
. Then
is the total wealth of the top i people

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!