Traffic Forecasting for GOV.UK

Simple approach to anomaly detection in non-stationary time-series data. Machine learning algorithms are used to predict hourly traffic to the entire GOV.UK domain. Anomalous points are indicted in light blue and are defined as being instances where the model residual is greater than 3σ from the actual number. This is potentially an overly sensitive threshold and is somewhat arbitrary in this instance. Future work will look at comparing this model to other more standard anomaly detection algorithms and the interesting issue of threshold setting.

Numerical Model

As a first attempt a basic model is used which aims to predict future traffic on historical trends. In this instance the model uses 10 lagging components of the time-series to help inform the latest hour forecast. A maximum lag of 72 hours is used here however there is more work needed to quantify the impact of additional values. Three approaches are contrasted in this tool, namely:

Random Forests
Non-parametric method averaged over 10 Forests with 500 trees in each. This aggregation is performed to account for the inherent randomness of the algorithm which leads to fluctuations in fit quality (MSE).

Neural Networks
A feed-forward multilayer perceptron with 40 neurons and a single output layer (mapping from the 10 lag inputs). Work on optimising architecture of the network is ongoing focussing on neuron number and hidden layers.

Linear Regression
The most standard approach which does not allow for non-linear interactions. The lagging components are individually weighted during the training stage and remain fixed on the week-long test data.

Anomaly Heatmap

While the top two charts show the complete hourly GOV.UK traffic over the period of a week, the bottom tiled heatmap is an attempt to highlight anomalies at the individual page level. Clearly an anomaly in the aggregate traffic may not be attributable to a single page however the distribution of pageviews across the hosted content is 'top-heavy'. This means that a significant proportion of traffic can be explained by a specific small set of themes such as those relating to jobs, visas, passports, pensions, bank holidays etc.

The heatmap shows the last 24 hours across the top 25 most popular pages. The colour of each tile shows how accurate the model has been in predicting the pageviews at that point in time. This map can be viewed as a condensed version of the residual plot and is an attempt at aiding pattern detection across a large range of individual pages. It is important to note that new models are trained for each specific page as their pageview profiles are often surprisingly diverse. A consequence of this is that the linear model is displayed here due to time constraints of bootstrapping the Random Forest approach. This is an area for further work.