Mark Naylor’s Blog

Mark Naylor’s Blog

I want to know how the Earth works…

Working with the Edinburgh University Met Station Data in Python

In this post, I thought I would show you how to load some time series data into a Pandas DataFrame.

The obvious choice of data is the School of GeoScience’s weather station, archive data is available for it on the web. The met station is positioned on the top of JCMB which has the view in the banner above 🙂

You can find the accompanying Jupyter Notebook here which contains runnable code and a more complete description.

1 Loading the weather station data to a Pandas DataFrame

The data is archived both in monthly snapshots and annual compilations – we will work with the latter.

The files are stored either as monthly or annual data. for example:

  • Annual filenames:
  • Monthly filenames:

If you look at one of the annual files, you can see:

  • The top line of the file has column names
  • The entries are comma separated
  • The first column is date-time information
  • The other columns all contain floats

Because of the mix of data types, we will need to use a Pandas DataFrame.

We want to create a single data frame which contains the met data for the years 2011-2014 inclusive, but the data is stored in separate files for each year. To do this, we use a loop to load each annual file into a data frame and then we append it to a data frame that will contain the whole sequence, which we call df_full.

df_full = pd.DataFrame()

years = [2011, 2012, 2013, 2014]

for year in years:
    url = ""+str(year)+".csv"
    df_tmp = pd.read_csv(url)
    df_full = df_full.append(df_tmp)

Let’s have a quick look at some of the features of this the DataFrame.

1.1 We can find the number of rows and columns using:


Tells us that there are 8 columns and 2,081,999 rows. This is quite a lot of data.

1.2 To see the top few lines of the DataFrame:

By running the head() method on the data frame we can see that there are so many rows because the data is sampled every minute. We can also see the different types of data stored in the data frame including temperature, pressure and windspeed information. Also note that we have a date-time column that we want to index the data by (i.e. instead of having the integer numbers in the leftmost column, we want date times – this allows for easier processing of the time series as we will see in section 3)


1.3 To get summary statistics of each column:

The describe method is really useful at summarising what is in each column. It also highlights potential problems with a dataset that need cleaning. In the example below, have a look at the min row. The minimum value in many columns is either -9999.0 or -6999.0. This will be the default value recorded when the sensor has not measured it correctly. We will need to filter these out.


2 Cleaning the data

The quick look at the data above highlights two thing we need to fix:

  1. The describe method highlighted that there are the values -9999.0 and -6999.0 in the dataset. Whilst I have not checked whilst these are there, I suspect that these are the default values saved when the sensors have not recorded properly. These will need to be removed, we will cover them to NaNs
  2. The data-time column a series of strings which we will want to convert to a date-time object and then use that as the index.

2.1 Removing the spurious values

To convert the entries with values -9999.0 and -6999.0 to NaNs is easy:

df_full = df_full.replace(-9999.0,np.NaN)
df_full = df_full.replace(-6999.0,np.NaN)

We can check this has worked by by re-running `df_full.describe()`

2.2 Setting the index to be the date-time

The first thing I tried was:

df_full['date-time'] = pd.to_datetime(df_full['date-time'], format='%Y/%m/%d %H:%M')

But this threw the error

ValueError: time data 2011/01/01 24:00 doesn't match format specified

I assume that this is because the definition of the 24 hour clock in Python does not actually include 24:00

So, to deal with this I coerce the conversion to datatime to return a NaNs whenever it fails so that I can just remove these entries:

df_full['date-time'] = pd.to_datetime(df_full['date-time'], format='%Y/%m/%d %H:%M', errors='coerce')
df_full = df_full.dropna()

The date-time column now contains date-time objects. All that I need to do now is set the index to be the date-time and check that this has worked using the head method:

df_full = df_full.set_index('date-time')

3. Plotting and subsetting the time series

I am not going to do any analysis of the data in this post, just show you how to plot and resample the time series.

3.1 Plotting all the data

Since we have this data inside a DataFrame, and we have set the index to be a Date-Time object – Pandas will do most of the work for formatting the date-times on the x-axis.

For example, we can simply plot all the data in the temperature column. The data looks quite noise because it was measured every minute!

df_full.plot(y='surface temperature (C)', style='.', markersize=0.5)

And similarly for windspeed:

df_full.plot(y='wind speed (m/s)', style='.', markersize=0.5)

3.2 Extracting subsets based on the date-time

The use of date-times as an index also makes it trivial to select subsets of the data based on time slices.

For example, to extract a and plot single year:

df_full['2012'].plot(y='surface temperature (C)', style='.', markersize=0.5)

And to extract and plot an arbitrary date range:

df_full['2012-06':'2014-06'].plot(y='surface temperature (C)', style='.', markersize=0.5)

3.3 Resampling the data at daily intervals

If you look back at any of the head() calls above, you will see that the data is recorded at minute intervals and that this means we are working with over 4 million rows.

  • One approach to reduce the size of the data is to remove the daily cycles by resampling at daily intervals.
  • This is easy now we have the data stored as a time series:
dailyResampling_max = df_full.resample('D').max()
dailyResampling_median = df_full.resample('D').median() 
dailyResampling_min = df_full.resample('D').min()

The example above extracts the max, min and median values that occur each day for each column.


Notice that the time intervals in the index column on the left are now daily.

Let’s plot the temperature data resampled at daily intervals. If you compare it with the plot above, you can see that there is significantly less scatter because we have removed the daily cycles.

  • The Maximum values, presumably in the day time, are very clean and range from 0 to 27 ish.
  • The Minimum values are typically less than 12 degrees and it looks like there is one spurious point at -40 degrees.












4. Summary

In this post I have show you how to load, clean, plot and resample weather station data.

In future posts, I’ll show you how to analyse and forecast from this data.


Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.