Question: (Using python) (Time-)Series data and quantiles Abstractly, the problem we consider can be described as follows: We are given a sequence of data points, X
(Using python)
(Time-)Series data and quantiles
Abstractly, the problem we consider can be described as follows: We are given a sequence of data points, X = x0, x1, .., x0, which we assume to be a representative sample of some (stochastic) process, and a frequency threshold F. We want to determine a value xF such that values equal to xF or greater appear in the sequence no more often than, on average, at every F:th position in the series. There are two ways we can determine this threshold value:
Method (a) Find the smallest value xF in X such that for any two values xi and xj in X, if xi >= xF and xj >= xFthen abs(j - i) >= F. In other words, any two values that are greater than or equal to xF are at least F steps apart.
Method (b) Find the smallest value xF in Xn / F values in X are greater than or equal to xF. In other words, xF is the (n / F):th largest value in the series.
The value computed by method (b) is the top 1/F quantile of the data. If the sequence is sampled from a stationary random process, we expect to see values at or above the 1/F quantile on average every F sample.
The two methods of calculating the threshold value xF will often give different results.
Both methods above are defined as identifying exceptionally (once every F) high values in the series. We can of course also find the threshold for exceptionally (once every F) low values in an analogous way, by reversing the inequality.
Weather data
The Australian Bureau of Meteorology provides access to some of their weather observations, including some historical observations (as far as their records go back). Here are some example files:
Rainfall_Sydney_066062.csv: Daily total rainfall at Observatory Hill in Sydney (station number 066062), since the middle of 1858.
Rainfall_Canberra_070247.csv: Daily total rainfall at the National Botanic Gardens, Canberra (station number 070247), since late 1968.
Rainfall_Queanbeyan_070072.csv: Daily total rainfall at Queanbeyan bowling club (station number 070072), since late of 1870, but only to 2015. (This seems to be the longest-running measurement series taken in a single location near Canberra.)
The data files are in comma-separated value (CSV) format. They are text files, where each line, except the first, corresponds to one (daily) observation. Each line has eight fields, which are separated by a comma (','). The first line is a header, which has headings for each of the fields. The fields are as follows:
| Column | Explanation |
| 0 | Product code (This is a string that identifies the type of data.) |
| 1 | Station number |
| 2 | Date of observation: Year |
| 3 | Date of observation: Month |
| 4 | Date of observation: Day |
| 5 | Observation data This is either the daily maximum temperature, in degrees Celsius, or daily total rainfaill, in millimetres, depending on the type of data file. |
| 6 | Number of days over which data was recorded. See detailed explanation below. |
| 7 | Quality assurance. This is either 'Y' (indicating that the data has been quality checked), or 'N' (if it has not). |
Note that entries are sometimes incomplete: in particular, the data, number of days, and quality assurance fields are may be missing. Incomplete entires must be ignored when reading the file. However, you can rely on the date always being present. Also, the dates are always in increasing order, with no date repeated. Data that has not been quality checked is not necessarily wrong; in fact, you should assume its correct.
Observations are not always made daily. Sometimes, there is no observation for some days, followed by one entry where the number of days field is greater than one. Here is an example from the Canberra rainfall data:
| Year | Month | Day | Data | #days |
| 1971 | 12 | 01 | ||
| 1971 | 12 | 02 | 13.2 | 2 |
| 1971 | 12 | 03 | ||
| 1971 | 12 | 04 | ||
| 1971 | 12 | 05 | ||
| 1971 | 12 | 06 | ||
| 1971 | 12 | 07 | 2.5 | 5 |
| 1971 | 12 | 08 | 6.1 | 1 |
| 1971 | 12 | 09 | 10.4 | 1 |
This means that on the 2nd of December, a total rainfall of 13.2mm was recorded for the last two days, i.e., the 1st and the 2nd; then, on the 7th a total of 2.5mm was recorded for the five days from the 3rd to the 7th; and so on.
Rainfall data from other weather stations (in this format) is available from this page.
Aggregating observations
The weather data files have daily observations. From this, we can calculate aggregate time series, such as total monthly rainfall. Specifically, we can calculate:
The total for each month. This gives a series of monthly values.
The total for a specific month of the year. This gives a series of yearly values.
The total for each year. This gives a series of yearly values.
When computing the total for a month or year, it is important to check that data for the time period is complete - that is, that there is an observation for every day of the month or year. In cases where we only have observations of the total rainfall over a number of days, we can still determine a monthly total if the observation intervals align with the beginning and end of the month, i.e., if there is no multi-day observation that spans more than one month. For example, the data shown for December 1971 in Canberra above allows us to determine the total rainfall over the first 9 days of the month, even though we do not know exactly how much fell on each day.
On the other hand, to obtain a series of daily observations we have to use only those observations that are taken on a single day.
Task
Write a program that reads in a data file in the format described above, calculates a time series (daily, monthly, or for a specific month of the year), and calculates the threshold value xF for once in F using both methods (a) and (b) described above.
The user of the program should be able to:
provide the path to the file to be read;
select the type of time series aggregation;
select whether to compute a threshold for exceptionally high or low values;
and specify the frequency F.
The program should output the computed threshold values (clearly showing which was calculated according to which method) and the dates (or months, or years) at which the data exceeds it.
Using your program, a user should be able to obtain answers to questions such as:
How much is a once in 20 years rainfall for the month of June, at each of the three stations listed above, and on which years did this occur?
What is the once in a 1000 days amount of rain in a single day, for each of the three stations above, and on which dates did this happen?
What is the a once in 20 years driest (least rain) year in Queanbeyan, and in which years did this occur?
Example
As an example, here is how we would find the answer to the first question for the Canberra (botanical gardens) station: First, we read the data file, sum up the total for each month, and select only the entries for the month of June (where those are complete). This gives us the following time series:
| June 1968 | 28.70mm |
| June 1969 | 56.00mm |
| June 1970 | 26.70mm |
| June 1971 | 5.40mm |
| June 1972 | 11.80mm |
| June 1973 | 47.00mm |
| June 1974 | 24.40mm |
| June 1975 | 70.70mm |
| June 1976 | 11.10mm |
| June 1977 | 36.80mm |
| June 1978 | 43.20mm |
| June 1979 | 5.60mm |
| June 1980 | 46.60mm |
| June 1981 | 136.20mm |
| June 1982 | 14.00mm |
| June 1983 | 44.00mm |
| June 1984 | 6.60mm |
| June 1985 | 37.20mm |
| June 1986 | 7.90mm |
| June 1987 | 33.20mm |
| June 1988 | 62.00mm |
| June 1989 | 35.20mm |
| June 1990 | 14.80mm |
| June 1991 | 143.80mm |
| June 1992 | 20.60mm |
| June 1993 | 33.40mm |
| June 1994 | 36.30mm |
| June 1995 | 52.80mm |
| June 1996 | 52.00mm |
| June 1997 | 115.60mm |
| June 1998 | 135.00mm |
| June 1999 | 46.00mm |
| June 2000 | 29.00mm |
| June 2001 | 42.00mm |
| June 2002 | 49.70mm |
| June 2003 | 47.80mm |
| June 2004 | 20.40mm |
| June 2005 | 93.60mm |
| June 2006 | 73.80mm |
| June 2007 | 74.60mm |
| June 2008 | 31.40mm |
| June 2009 | 37.30mm |
| June 2010 | 29.60mm |
| June 2011 | 10.80mm |
| June 2012 | 48.70mm |
| June 2013 | 89.20mm |
| June 2014 | 73.20mm |
| June 2015 | 72.20mm |
| June 2016 | 148.30mm |
| June 2017 | 0.40mm |
Lets start with the simpler method, (b): There are 50 entries in the series, so the once-in-20-years threshold is the 50*(1/20)th highest value. Rounding the fraction down, thats the 2nd highest value in the series, which is 143.8. The two years in which the June rainfall equals or exceeds this threshold are 1991 and 2016. This is illustrated in the following graph:
The dashed line is the threshold; we can see that only two bars reach it.
Method (a) is trickier, and its for you to figure out how to compute the threshold value following its definition. In this particular example, it gives the same threshold (143.8mm). We can see from the graph that the two peaks that reach this value are more than 20 years apart, and also that if we lower the threshold to the next smaller value (from 1981) there would be two peaks less than 20 years apart. If we were to choose a more frequent interval, such as once every 10 years, the two methods would give different thresholds.
Plotting the data
Generating graphs like the one above is not a required part of the assignment, but it is a useful tool for validating and debugging your code. The graph above was drawn using matplotlib as follows:
import numpy as np import matplotlib.pyplot as mpl years = [1968, 1969, ..., 2016, 217] # list of years in the table above y = [28.7, 56.0, ..., 148.3, 0.4] # corresponding list of total rainfall values x = np.arange(0, len(y)) mpl.bar(x + 0.25, y, 0.5, color='blue') mpl.plot([0, len(y) + 1], [143.8, 143.8], '--k') # the threshold line mpl.xticks(x + 0.5, years, rotation=90) mpl.show()
A different plot, using only points instead of vertical bars for each data point, can be done as follows:
# years, x and y as above mpl.plot(x, y, '.b', markersize=5) mpl.plot([0, len(y) + 1], [143.8, 143.8], '--k') mpl.xticks(x, years, rotation=90) mpl.show()
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
