Question: (Using python) (Time-)Series data and quantiles Abstractly, the problem we consider can be described as follows: We are given a sequence of data points, X

(Using python)

(Time-)Series data and quantiles

Abstractly, the problem we consider can be described as follows: We are given a sequence of data points, X = x0, x1, .., x0, which we assume to be a representative sample of some (stochastic) process, and a frequency threshold F. We want to determine a value xF such that values equal to xF or greater appear in the sequence no more often than, on average, at every F:th position in the series. There are two ways we can determine this threshold value:

Method (a) Find the smallest value xF in X such that for any two values xi and xj in X, if xi >= xF and xj >= xFthen abs(j - i) >= F. In other words, any two values that are greater than or equal to xF are at least F steps apart.

Method (b) Find the smallest value xF in Xn / F values in X are greater than or equal to xF. In other words, xF is the (n / F):th largest value in the series.

The value computed by method (b) is the top 1/F quantile of the data. If the sequence is sampled from a stationary random process, we expect to see values at or above the 1/F quantile on average every F sample.

The two methods of calculating the threshold value xF will often give different results.

Both methods above are defined as identifying exceptionally (once every F) high values in the series. We can of course also find the threshold for exceptionally (once every F) low values in an analogous way, by reversing the inequality.

Weather data

The Australian Bureau of Meteorology provides access to some of their weather observations, including some historical observations (as far as their records go back). Here are some example files:

Rainfall_Sydney_066062.csv: Daily total rainfall at Observatory Hill in Sydney (station number 066062), since the middle of 1858.

Rainfall_Canberra_070247.csv: Daily total rainfall at the National Botanic Gardens, Canberra (station number 070247), since late 1968.

Rainfall_Queanbeyan_070072.csv: Daily total rainfall at Queanbeyan bowling club (station number 070072), since late of 1870, but only to 2015. (This seems to be the longest-running measurement series taken in a single location near Canberra.)

The data files are in comma-separated value (CSV) format. They are text files, where each line, except the first, corresponds to one (daily) observation. Each line has eight fields, which are separated by a comma (','). The first line is a header, which has headings for each of the fields. The fields are as follows:

Column Explanation
0 Product code (This is a string that identifies the type of data.)
1 Station number
2 Date of observation: Year
3 Date of observation: Month
4 Date of observation: Day
5 Observation data This is either the daily maximum temperature, in degrees Celsius, or daily total rainfaill, in millimetres, depending on the type of data file.
6 Number of days over which data was recorded. See detailed explanation below.
7 Quality assurance. This is either 'Y' (indicating that the data has been quality checked), or 'N' (if it has not).

Note that entries are sometimes incomplete: in particular, the data, number of days, and quality assurance fields are may be missing. Incomplete entires must be ignored when reading the file. However, you can rely on the date always being present. Also, the dates are always in increasing order, with no date repeated. Data that has not been quality checked is not necessarily wrong; in fact, you should assume its correct.

Observations are not always made daily. Sometimes, there is no observation for some days, followed by one entry where the number of days field is greater than one. Here is an example from the Canberra rainfall data:

Year Month Day Data #days
1971 12 01
1971 12 02 13.2 2
1971 12 03
1971 12 04
1971 12 05
1971 12 06
1971 12 07 2.5 5
1971 12 08 6.1 1
1971 12 09 10.4 1

This means that on the 2nd of December, a total rainfall of 13.2mm was recorded for the last two days, i.e., the 1st and the 2nd; then, on the 7th a total of 2.5mm was recorded for the five days from the 3rd to the 7th; and so on.

Rainfall data from other weather stations (in this format) is available from this page.

Aggregating observations

The weather data files have daily observations. From this, we can calculate aggregate time series, such as total monthly rainfall. Specifically, we can calculate:

The total for each month. This gives a series of monthly values.

The total for a specific month of the year. This gives a series of yearly values.

The total for each year. This gives a series of yearly values.

When computing the total for a month or year, it is important to check that data for the time period is complete - that is, that there is an observation for every day of the month or year. In cases where we only have observations of the total rainfall over a number of days, we can still determine a monthly total if the observation intervals align with the beginning and end of the month, i.e., if there is no multi-day observation that spans more than one month. For example, the data shown for December 1971 in Canberra above allows us to determine the total rainfall over the first 9 days of the month, even though we do not know exactly how much fell on each day.

On the other hand, to obtain a series of daily observations we have to use only those observations that are taken on a single day.

Task

Write a program that reads in a data file in the format described above, calculates a time series (daily, monthly, or for a specific month of the year), and calculates the threshold value xF for once in F using both methods (a) and (b) described above.

The user of the program should be able to:

provide the path to the file to be read;

select the type of time series aggregation;

select whether to compute a threshold for exceptionally high or low values;

and specify the frequency F.

The program should output the computed threshold values (clearly showing which was calculated according to which method) and the dates (or months, or years) at which the data exceeds it.

Using your program, a user should be able to obtain answers to questions such as:

How much is a once in 20 years rainfall for the month of June, at each of the three stations listed above, and on which years did this occur?

What is the once in a 1000 days amount of rain in a single day, for each of the three stations above, and on which dates did this happen?

What is the a once in 20 years driest (least rain) year in Queanbeyan, and in which years did this occur?

Example

As an example, here is how we would find the answer to the first question for the Canberra (botanical gardens) station: First, we read the data file, sum up the total for each month, and select only the entries for the month of June (where those are complete). This gives us the following time series:

June 1968 28.70mm
June 1969 56.00mm
June 1970 26.70mm
June 1971 5.40mm
June 1972 11.80mm
June 1973 47.00mm
June 1974 24.40mm
June 1975 70.70mm
June 1976 11.10mm
June 1977 36.80mm
June 1978 43.20mm
June 1979 5.60mm
June 1980 46.60mm
June 1981 136.20mm
June 1982 14.00mm
June 1983 44.00mm
June 1984 6.60mm
June 1985 37.20mm
June 1986 7.90mm
June 1987 33.20mm
June 1988 62.00mm
June 1989 35.20mm
June 1990 14.80mm
June 1991 143.80mm
June 1992 20.60mm
June 1993 33.40mm
June 1994 36.30mm
June 1995 52.80mm
June 1996 52.00mm
June 1997 115.60mm
June 1998 135.00mm
June 1999 46.00mm
June 2000 29.00mm
June 2001 42.00mm
June 2002 49.70mm
June 2003 47.80mm
June 2004 20.40mm
June 2005 93.60mm
June 2006 73.80mm
June 2007 74.60mm
June 2008 31.40mm
June 2009 37.30mm
June 2010 29.60mm
June 2011 10.80mm
June 2012 48.70mm
June 2013 89.20mm
June 2014 73.20mm
June 2015 72.20mm
June 2016 148.30mm
June 2017 0.40mm

Lets start with the simpler method, (b): There are 50 entries in the series, so the once-in-20-years threshold is the 50*(1/20)th highest value. Rounding the fraction down, thats the 2nd highest value in the series, which is 143.8. The two years in which the June rainfall equals or exceeds this threshold are 1991 and 2016. This is illustrated in the following graph:

The dashed line is the threshold; we can see that only two bars reach it.

Method (a) is trickier, and its for you to figure out how to compute the threshold value following its definition. In this particular example, it gives the same threshold (143.8mm). We can see from the graph that the two peaks that reach this value are more than 20 years apart, and also that if we lower the threshold to the next smaller value (from 1981) there would be two peaks less than 20 years apart. If we were to choose a more frequent interval, such as once every 10 years, the two methods would give different thresholds.

Plotting the data

Generating graphs like the one above is not a required part of the assignment, but it is a useful tool for validating and debugging your code. The graph above was drawn using matplotlib as follows:

import numpy as np import matplotlib.pyplot as mpl years = [1968, 1969, ..., 2016, 217] # list of years in the table above y = [28.7, 56.0, ..., 148.3, 0.4] # corresponding list of total rainfall values x = np.arange(0, len(y)) mpl.bar(x + 0.25, y, 0.5, color='blue') mpl.plot([0, len(y) + 1], [143.8, 143.8], '--k') # the threshold line mpl.xticks(x + 0.5, years, rotation=90) mpl.show() 

A different plot, using only points instead of vertical bars for each data point, can be done as follows:

# years, x and y as above mpl.plot(x, y, '.b', markersize=5) mpl.plot([0, len(y) + 1], [143.8, 143.8], '--k') mpl.xticks(x, years, rotation=90) mpl.show() 

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!