I got rained on the other day so I decided to create a machine learning weather forecasting algorithm. I’ve often wondered what accuracy one can attain when forecasting temperature, now I can find out for myself. In this post I will describe the process to forecast maximum temperatures using R.
There are two challenges involved in building such an algorithm:
1. Getting the data.
2. Knowing what to do with it.
Fortunately, it is relatively easy to find weather data these days. We’ll be using data from the excellent metrologists at the Australian Bureau of Meteorology, or BoM for short. They’ve meticulously set up weather stations all across Australia, the output data of which they feed into a random number generator to forecast weather.
The Data
The data is contained in files organised in a folder hierarchy, each file contains one month of data, and the data goes back about 10 years, and there are several hundred weather stations. We can download them with the following code in R:
link_address <- "ftp://ftp.bom.gov.au/anon/gen/clim_data/IDCKWCDEA0.tgz" download.file(link_address, "data/weather.tgz") untar("data/weather.tgz", exdir = "data/") |
We’ll need to build a function that can parse the bizarrely formatted data file, then apply this function to each file using a loop – concatenating the data as we go.
weather_readr <- function(file_name = "file name") { df_names <- c("Station", "Date", "Etrans", "rain", "Epan", "max_Temp", "min_Temp", "Max_hum", "Min_hum", "Wind", "Rad") read.csv(text=paste0(head(readLines(file_name), -1), collapse="\n"), skip = 12, col.names = df_names) } file_loc <- "data/tables/vic/melbourne_airport/" df <- data.frame() for (files in list.files(file_loc, full.names = TRUE, pattern="*.csv")) { dfday <- weather_readr(files) df <- rbind(df, dfday) } df <- df %>% mutate(Date = dmy(Date)) |
Now we have a table that looks like this:
Station | Date | Etrans | rain | Epan | max_Temp | min_Temp | Max_hum | Min_hum | Wind | Rad |
---|---|---|---|---|---|---|---|---|---|---|
MELBOURNE AIRPORT | 2009-01-01 | 5.4 | 0 | 5.6 | 19.9 | 11.2 | 91 | 28 | 7.35 | 22.34 |
MELBOURNE AIRPORT | 2009-01-02 | 5.4 | 1.2 | 7.2 | 17.8 | 7.8 | 77 | 35 | 7.1 | 28.45 |
MELBOURNE AIRPORT | 2009-01-03 | 5.4 | 0 | 6.2 | 21.1 | 6.3 | 85 | 29 | 3.44 | 29.99 |
MELBOURNE AIRPORT | 2009-01-04 | 7.5 | 0 | 6.4 | 29.2 | 8.1 | 89 | 16 | 3.68 | 32.69 |
MELBOURNE AIRPORT | 2009-01-05 | 7.5 | 0 | 7.4 | 29 | 9.7 | 83 | 25 | 3.85 | 33.96 |
Most of these fields are self-explanatory, however there are also fields describing evaporation (Etrans), radiation (Rad) and humidity (_hum) metrics. We will concern ourselves with max_Temp, the daily maximum temperature. Well, now that we have the data, what do we do with it?
Preprocessing
Let’s start with just a proof of concept:
Can we forecast the maximum temperature for a location based on the previous day’s weather?
This isn’t intended to be accurate, only to show that a simple predictive pipeline can be built – we can improve it later. We’ll predict the weather for the Melbourne Airport weather station.
For a simple predictor, we can include yesterday’s temperatures:
df <- df %>% select(Date, max_Temp, min_Temp) %>% mutate(TempMax1 = lag(max_Temp, n = 1), TempMin1 = lag(min_Temp, n = 1)) |
And in the same manner, we include temperatures from two days ago:
df <- df %>% mutate(TempMax2 = lag(max_Temp, n = 2), TempMin2 = lag(min_Temp, n = 2)) |
And we can dispense with any incomplete data, and the date field. We will also remove the minimum temperature field, since this occurs on the same day we are predicting, it constitutes a data leak i.e. cheating. In any case we won’t have access to this value when we are predicting future temperatures.
df <- df %>% na.omit() %>% select(-Date, -min_Temp) |
Let’s take a look at the transformed dataset:
max_Temp | TempMax1 | TempMin1 | TempMax2 | TempMin2 |
---|---|---|---|---|
21.1 | 17.8 | 7.8 | 19.9 | 11.2 |
29.2 | 21.1 | 6.3 | 17.8 | 7.8 |
29 | 29.2 | 8.1 | 21.1 | 6.3 |
31.7 | 29 | 9.7 | 29.2 | 8.1 |
21.4 | 31.7 | 13.5 | 29 | 9.7 |
18.4 | 21.4 | 15.8 | 31.7 | 13.5 |
The first column on the left max_Temp is the value we will try to predict – the maximum temperature of the day. The other fields are the minimum and maximum of previous days’ weather, these will inform the model. Now imagine we’re trying to forecast temperature for tomorrow, we presumably know today’s maximum and minimum temperature, and yesterday’s maximum and minimum. If we build a model just based on these fields, there is no reason why we can’t forecast tomorrow’s temperature. That’s what we’re going to do now.
The Model
We will use the algorithms provided by the good people at h2o.ai. The data set it is ready to go, so the remaining steps are trivial:
1. launch h2o machine learning server
2. convert data to h2o object
3. split data into testing and training data sets
4. train model on training data set
5. test model on testing data set
library("h2o") h2o.init() # launch machine learning server hex_df <- as.h2o(df) # convert to h2o object hex_df <- h2o.splitFrame(hex_df) # split into test and training h2o.gbm(training_frame = hex_df[[1]], y = 1) # train model h2o.performance(weather_model, hex_df[[2]]) # test model ... MAE: 3.10 ... |
This spits out a few different measures of accuracy. To keep things simple we’ll only consider the mean average error (MAE) since it is easy to understand. So a MAE of 3.1 means that our model is, on average, a few degrees off.
Now that we’ve proved out the methodology, we can go about adding features to improve the accuracy of the model. This will be the subject of the next post.