Question: (Using python) (Time-)Series data and quantiles Abstractly, the problem we consider can be described as follows: We are given a sequence of data points, X

(Using python)

(Time-)Series data and quantiles

Abstractly, the problem we consider can be described as follows: We are given a sequence of data points, X = x0, x1, .., x0, which we assume to be a representative sample of some (stochastic) process, and a frequency threshold F. We want to determine a value xF such that values equal to xF or greater appear in the sequence no more often than, on average, at every F:th position in the series. There are two ways we can determine this threshold value:

Method (a) Find the smallest value xF in X such that for any two values xi and xj in X, if xi >= xF and xj >= xFthen abs(j - i) >= F. In other words, any two values that are greater than or equal to xF are at least F steps apart.

Method (b) Find the smallest value xF in Xn / F values in X are greater than or equal to xF. In other words, xF is the (n / F):th largest value in the series.

The value computed by method (b) is the top 1/F quantile of the data. If the sequence is sampled from a stationary random process, we expect to see values at or above the 1/F quantile on average every F sample.

The two methods of calculating the threshold value xF will often give different results.

Both methods above are defined as identifying exceptionally (once every F) high values in the series. We can of course also find the threshold for exceptionally (once every F) low values in an analogous way, by reversing the inequality.

Weather data

The Australian Bureau of Meteorology provides access to some of their weather observations, including some historical observations (as far as their records go back). Here are some example files:

Rainfall_Sydney_066062.csv: Daily total rainfall at Observatory Hill in Sydney (station number 066062), since the middle of 1858.

Rainfall_Canberra_070247.csv: Daily total rainfall at the National Botanic Gardens, Canberra (station number 070247), since late 1968.

Rainfall_Queanbeyan_070072.csv: Daily total rainfall at Queanbeyan bowling club (station number 070072), since late of 1870, but only to 2015. (This seems to be the longest-running measurement series taken in a single location near Canberra.)

The data files are in comma-separated value (CSV) format. They are text files, where each line, except the first, corresponds to one (daily) observation. Each line has eight fields, which are separated by a comma (','). The first line is a header, which has headings for each of the fields. The fields are as follows:

Column	Explanation
0	Product code (This is a string that identifies the type of data.)
1	Station number
2	Date of observation: Year
3	Date of observation: Month
4	Date of observation: Day
5	Observation data This is either the daily maximum temperature, in degrees Celsius, or daily total rainfaill, in millimetres, depending on the type of data file.
6	Number of days over which data was recorded. See detailed explanation below.
7	Quality assurance. This is either 'Y' (indicating that the data has been quality checked), or 'N' (if it has not).

Note that entries are sometimes incomplete: in particular, the data, number of days, and quality assurance fields are may be missing. Incomplete entires must be ignored when reading the file. However, you can rely on the date always being present. Also, the dates are always in increasing order, with no date repeated. Data that has not been quality checked is not necessarily wrong; in fact, you should assume its correct.

Observations are not always made daily. Sometimes, there is no observation for some days, followed by one entry where the number of days field is greater than one. Here is an example from the Canberra rainfall data:

Year	Month	Day	Data	#days
1971	12	01
1971	12	02	13.2	2
1971	12	03
1971	12	04
1971	12	05
1971	12	06
1971	12	07	2.5	5
1971	12	08	6.1	1
1971	12	09	10.4	1

This means that on the 2nd of December, a total rainfall of 13.2mm was recorded for the last two days, i.e., the 1st and the 2nd; then, on the 7th a total of 2.5mm was recorded for the five days from the 3rd to the 7th; and so on.

Rainfall data from other weather stations (in this format) is available from this page.

Aggregating observations

The weather data files have daily observations. From this, we can calculate aggregate time series, such as total monthly rainfall. Specifically, we can calculate:

The total for each month. This gives a series of monthly values.

The total for a specific month of the year. This gives a series of yearly values.

The total for each year. This gives a series of yearly values.

When computing the total for a month or year, it is important to check that data for the time period is complete - that is, that there is an observation for every day of the month or year. In cases where we only have observations of the total rainfall over a number of days, we can still determine a monthly total if the observation intervals align with the beginning and end of the month, i.e., if there is no multi-day observation that spans more than one month. For example, the data shown for December 1971 in Canberra above allows us to determine the total rainfall over the first 9 days of the month, even though we do not know exactly how much fell on each day.

On the other hand, to obtain a series of daily observations we have to use only those observations that are taken on a single day.

Task

Write a program that reads in a data file in the format described above, calculates a time series (daily, monthly, or for a specific month of the year), and calculates the threshold value xF for once in F using both methods (a) and (b) described above.

The user of the program should be able to:

provide the path to the file to be read;

select the type of time series aggregation;

select whether to compute a threshold for exceptionally high or low values;

and specify the frequency F.

The program should output the computed threshold values (clearly showing which was calculated according to which method) and the dates (or months, or years) at which the data exceeds it.

Using your program, a user should be able to obtain answers to questions such as:

How much is a once in 20 years rainfall for the month of June, at each of the three stations listed above, and on which years did this occur?

What is the once in a 1000 days amount of rain in a single day, for each of the three stations above, and on which dates did this happen?

What is the a once in 20 years driest (least rain) year in Queanbeyan, and in which years did this occur?

Example

As an example, here is how we would find the answer to the first question for the Canberra (botanical gardens) station: First, we read the data file, sum up the total for each month, and select only the entries for the month of June (where those are complete). This gives us the following time series:

June 1968	28.70mm
June 1969	56.00mm
June 1970	26.70mm
June 1971	5.40mm
June 1972	11.80mm
June 1973	47.00mm
June 1974	24.40mm
June 1975	70.70mm
June 1976	11.10mm
June 1977	36.80mm
June 1978	43.20mm
June 1979	5.60mm
June 1980	46.60mm
June 1981	136.20mm
June 1982	14.00mm
June 1983	44.00mm
June 1984	6.60mm
June 1985	37.20mm
June 1986	7.90mm
June 1987	33.20mm
June 1988	62.00mm
June 1989	35.20mm
June 1990	14.80mm
June 1991	143.80mm
June 1992	20.60mm
June 1993	33.40mm
June 1994	36.30mm
June 1995	52.80mm
June 1996	52.00mm
June 1997	115.60mm
June 1998	135.00mm
June 1999	46.00mm
June 2000	29.00mm
June 2001	42.00mm
June 2002	49.70mm
June 2003	47.80mm
June 2004	20.40mm
June 2005	93.60mm
June 2006	73.80mm
June 2007	74.60mm
June 2008	31.40mm
June 2009	37.30mm
June 2010	29.60mm
June 2011	10.80mm
June 2012	48.70mm
June 2013	89.20mm
June 2014	73.20mm
June 2015	72.20mm
June 2016	148.30mm
June 2017	0.40mm

Lets start with the simpler method, (b): There are 50 entries in the series, so the once-in-20-years threshold is the 50*(1/20)th highest value. Rounding the fraction down, thats the 2nd highest value in the series, which is 143.8. The two years in which the June rainfall equals or exceeds this threshold are 1991 and 2016. This is illustrated in the following graph:

The dashed line is the threshold; we can see that only two bars reach it.

Method (a) is trickier, and its for you to figure out how to compute the threshold value following its definition. In this particular example, it gives the same threshold (143.8mm). We can see from the graph that the two peaks that reach this value are more than 20 years apart, and also that if we lower the threshold to the next smaller value (from 1981) there would be two peaks less than 20 years apart. If we were to choose a more frequent interval, such as once every 10 years, the two methods would give different thresholds.

Plotting the data

Generating graphs like the one above is not a required part of the assignment, but it is a useful tool for validating and debugging your code. The graph above was drawn using matplotlib as follows:

import numpy as np import matplotlib.pyplot as mpl years = [1968, 1969, ..., 2016, 217] # list of years in the table above y = [28.7, 56.0, ..., 148.3, 0.4] # corresponding list of total rainfall values x = np.arange(0, len(y)) mpl.bar(x + 0.25, y, 0.5, color='blue') mpl.plot([0, len(y) + 1], [143.8, 143.8], '--k') # the threshold line mpl.xticks(x + 0.5, years, rotation=90) mpl.show()

A different plot, using only points instead of vertical bars for each data point, can be done as follows:

# years, x and y as above mpl.plot(x, y, '.b', markersize=5) mpl.plot([0, len(y) + 1], [143.8, 143.8], '--k') mpl.xticks(x, years, rotation=90) mpl.show()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Question: Evaluate the two forecasting models described in the case for predicting daily check-in volume. What are the strengths and weaknesses of each one? Do you find any of the results surprising?...

SUMMARY this article for about 2 pages with single spacing. Thank you. ECONOMETRICS: MEASUREMENT I N ECONOMICS1 I As the first incumbent of a new Chair in this University, I thought that in this...

summarize the main idea of each article, discuss issues being highlighted briefly, give opinion pertaining the coverage of each article and provide recommendations for each issue in the article....

PLEASE HELP ME IN SOLVING THIS. (40)1. A service has five tasks, performed in sequence. In the instance when there is more than one worker assigned to a task, each worker performs the entire task and...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

1. A player throws a fair die and simultaneously flips a fair coin. If the coin lands heads, then she wins twice, and if tails, then one-half of the value that appears on the die. Determine her...

Briefly describe ASCII and Unicode and draw attention to any relationship between them. [3 marks] (b) Briefly explain what a Reader is in the context of reading characters from data. [3 marks] A...

Math 104A Homework #3 Instructor: Lihui Chai General Instructions: Please write your homework papers neatly. You need to turn in both full printouts of your codes and the appropriate runs you made....

Suppose that R(A, B, C) is a relational schema with functional dependencies F = {A, B C, C B}. (i) Is this schema in 3NF? Explain. [2 marks] (ii) Is this schema in BCNF? Explain. [2 marks] (b)...

The Unix utility whose can be used to find the domain name corresponding to an organization or vice versa. Read the man page documentation for whois and experiment with it. Try whois princeton.edu...

Soft is a software co. established 6 years ago by PCC (a construction firm). PCC cannot give a very high salary as it would not be consistent with their other business. The ultimate aim of MSoft is...

The information below pertains to Tamarisk Company for 2 0 2 6 . There were no changes during 2 0 2 6 in the number of common shares, preferred shares, or convertible bonds outstanding. There is no...

Watermill Pools firma un acuerdo de 5 aos para arrendar equipos de Southampton Pool Supply. El equipo tiene una vida til de 6 aos. Cual de los siguientes es verdadero? 1-El arrendamiento debe...