WORLD METEOROLOGICAL ORGANIZATION
_______________
COMMISSION FOR BASIC SYSTEMS
EXPERT TEAM MEETING ON
ENSEMBLE PREDICTION SYSTEMS
TOKYO, JAPAN, 1519 OCTOBER 2001
CBS ET/EPS/Doc 3(6)
(21.IX.2001)
ITEMS: 3 and 4
______
Original: ENGLISH
REPORT ON THE OPERATIONAL USE OF EPS, TO FORECAST SEVERE WEATHER AND EXTREME EVENTS
(Submitted by K. Mylne, C. Woolcock, J. DenholmPrice, T. Legg and R. Darvell, UK)
____________________________________________________
Summary and purpose of document
Development of two major new applications based on the EPS
over the year 200001.
_____________________________________________________
Action proposed
The expert team is invited to consider the document and to make
proposals based on this information.
CBS ET/EPS/Doc.3(6), p.2
APPLICATIONS OF THE EPS AT THE MET OFFICE
1. Introduction
The Met Office has developed two major new applications based on the EPS over the year 2000/01:

Improved sitespecific probability forecasts using Kalman Filter and Calibration systems

Early Warnings of Severe Weather
These will both be described briefly, and some sample verification results presented.
2. Sitespecific Probability Forecasts
Sitespecific ensemble forecasts were first introduced on the Met Office intranet in August 1997. These allowed forecasters to view the spread of temperatures, rainfall amounts and wind speeds predicted by the ECMWF Ensemble Prediction System (EPS) for 41 sites around the UK, and also probabilities of a range of predefined "events" such as Temperature<0C (Legg et al, 2001). These displays have been quite useful to MediumRange forecasters and as a training tool for people learning about ensemble forecasting, but suffered from a number of weaknesses as an operational tool for forecast production. A recent project has upgraded the system to improve the quality of forecasts and also to make the data available in a form suitable for direct insertion into forecast products, either automatically or as a firstguess for forecasters. In the following sections we will address the various upgrades installed.
2.1 Data Storage
Under the old system, forecasts were only available as images on the web. This made forecast production laborious and expensive, and in practise it did not happen much. To overcome this ensemble data have been placed in a database from where they can be used to generate products automatically. This also makes forecast production much more flexible to customer requirements. Ensemble data are stored for a large number of sites around the UK, Europe and N.America, and a smaller number elsewhere around the world. The weather parameters available are:

2m Temperature, including daily maximum and minimum temperatures

10m Wind Speed

Precipitation  12 hour accumulations
Two types of data are stored:

raw data interpolated directly from the EPS model fields to the local site;

Kalman Filtered data.
Data and products are now available at 6hourly intervals, compared to 12hourly before.
2.2 Kalman Filter
The old system used values of weather parameters interpolated directly from model gridpoints to local sites. This resulted in some severe sitespecific biases, particularly around the coasts and near hills. It also relied on the model's surfacelayer parametrisation to generate surface temperatures and windspeeds, which could cause severe errors, particularly in stable nocturnal conditions. To help overcome these problems a Kalman Filter system (KFMOS) has been introduced. This is a form of statistical postprocessing of model output to produce modified forecasts which attempt to derive forecast weather parameters by minimising the error between forecasts and observations, using recursive regression with a memory of approximately 60 days. KFMOS provides several advantages:
CBS ET/EPS/Doc.3(6), p.3

It corrects for local sitespecific biases such as a consistent overprediction of 10m windspeed at night. (KFMOS will not correct biases due to model drift during the forecast, as it uses the same correction, based on analysis fields or very shortperiod forecasts, for all forecast leadtimes. Use of correction by forecast time would damage the useful

spread of the ensemble.)

KFMOS is used to statistically derive Maximum and Minimum Temperatures, whereas previously temperatures were only available at 00 and 12 UTC.

The KFMOS statistics applied to derive temperature and windspeed from the model each take several model parameters as input to improve the derived values. For example the KFMOS for Maximum Temperature uses model 2m temperature, 10m windspeed and winddirection. Thus for a coastal site, for example, it can make some allowance for whether the wind is coming from the land or the sea.
2.3 Calibration of Probabilities
A common problem with ensemble systems is that they do not manage to generate sufficient spread (dispersion) to cover the full forecast uncertainty. This results in overconfident probability forecasts and is most clearly illustrated in a Rank Histogram (eg figure 1). The Rank Histogram shows the frequency with which the observation falls into each rank defined by the spread of members in the ensemble forecast. For example the leftmost bar of the histogram shows how often, relatively, the observation value is lower than all the forecast values from the ensemble. The example shown is fairly typical, and it is clear that both outlier bins are severely overpopulated. This indicates that the ensemble spread is too small to cover the full uncertainty in the forecast. This occurs with all forecast weather parameters, but is most acute for surface parameters at specific sites. To reduce these problems we have introduced a calibration of the spread, and the resulting probabilities.
A simple calibration of probabilities for a particular event can be provided from a verification reliability diagram by identifying the probability of the event really occurring for a given ensemble probability. However we required a more flexible system which could be applied to the needs of different customers, requiring forecasts of different events. For this we required a system which would calibrate the ensemble spread and allow the corrected probabilities to be calculated. This can be provided from Rank Histograms. The ranking allows a weighting to be applied to each ensemble member (or strictly speaking to the gaps between the members) to better calibrate the ensemble probabilities. Because it simply alters the relative weightings of the members, this method can be used to calibrate the probabilities of any forecast "event" the customer may require.
There is still one significant limitation of this calibration. From figure 1 it can be seen that a very large weighting will be applied to each of the outlier bins, so that calibrated forecasts will simply give a large probability that the actual temperature will be outside the range of the ensemble. The resulting PDF (probability density function) is shown in red in figure 2. The PDF now has large peaks at either end of the distribution, which is clearly not meteorologically realistic, and the total spread of the PDF is unaltered. Thus for the second part of the calibration we generate statistics of how far the outliers lie beyond the most extreme member of the ensemble. In order to use these statistics for calibration we fit a parametrized distribution to these outliers. The function used is a Weibull distribution, which was chosen for its ability to fit a range of different functional shapes. An example is shown in figure 3. Using these fitted Weibull distributions we are able to calibrate the tail of the ensemble distribution beyond the actual spread of the ensemble, spreading the high probabilities in the peaks at the ends of the red PDF in figure 2 out to include a greater spread of temperatures, as shown by the blue curve. The use of the
CBS ET/EPS/Doc.3(6), p.4
Weibull tails does not fully eliminate the peaks at the extremes of the calibrated PDF, but they are reduced and the total spread of the PDF is now increased. Not all outlier distributions are as wellfitted by a Weibull distribution as the example in figure 3, but careful significance testing has shown that errors due to poor fitting are not serious. Software for calculation of the statistics includes a number of checks to highlight any problems with fitting of the distributions. In the worst cases, where a reasonable fit cannot be achieved, Weibull tails are simply not
CBS ET/EPS/Doc.3(6), p.5
Figure 1: Sample rank histogram as used for calibration. Each of the 52 bars on the histogram represents one of the ranked intervals between the 51 ensemble members, including the outlier intervals outside the entire range of the ensemble. The histogram shows the relative frequency with which the verifying observations fall into each ranked bin  for a perfect ensemble all bins should be equally populated.

Figure 2: A sample temperature PDF (Probability Density Function) before (black) and after (red) calibration using the simple weights from the rank histogram shown in figure 2. The blue curve shows the calibrated PDF further modified by applying fitted Weibull distributions to the ensemble outliers.

Figure 3: A sample distribution of outlier observations (blue histogram) fitted with a Weibull distribution (black curve) for use in calibration.
available for a few parameters and leadtimes. This mostly occurs when there are few outliers and hence little data to define the tails. In these cases the tails are not important
CBS ET/EPS/Doc.3(6), p.3
CBS ET/EPS/Doc.3(6), p.6
anyway and the calibration without Weibull tails can be used as a good alternative.At first sight the calibrated PDFs, with or without the fitted Weibull tails (figure 2), appear less physically realistic than the raw PDFs produced by the ensemble. However, although the raw PDFs look qualitatively realistic and meteorologically sensible, we know from past verification that the resulting probabilities are far from perfect. There is no direct evidence that the detailed shapes of the PDFs are realistic, but we do know that the raw ensemble does not spread sufficiently to cover the full uncertainty, and that probabilities are generally overconfident. Thus the shapes of the raw PDFs may not be much more physically realistic than the calibrated ones. The project to develop the new system included a comprehensive verification system to compare the skill of different levels of postprocessing. An example reliability diagram from this system is shown in figure 4, for probabilities of windspeeds of force 6 or more. It is clear that for the raw data (green) the event is consistently overforecast (probabilities too high). This is largely corrected by the KFMOS (red) which is biascorrecting the windspeeds which are too strong, but the forecasts are still overconfident as the slope of the curve is less than the ideal 45. After calibration, either with the Weibull tails (dark blue) or without (pale blue), the slope is very close to ideal. Clearly in this case the calibrated forecasts are much better than the raw forecasts. This result is broadly typical, and for the vast majority of events the calibrated forecasts provide the best probabilities. Another example is shown in figure 5 which presents Brier Scores for 2m Temperature at 12 UTC <5C at various leadtimes. Bearing in mind that the ideal Brier Score is zero, the KFMOS clearly improves the forecasts considerably from the raw data, and the calibration improves them further, at least up to T+144. The only general exception is that the calibration is often less useful for the more extreme weather events  this is perhaps not surprising, since the statistics on which the calibration is based are dominated by nonextreme weather.
2.4 Graphical Displays
Although the new system is geared to automatic forecast production, there is still a requirement for graphical displays. Two options are available: EPS Meteograms (figure 6) and PDFs (figure 7). The Meteogram software is based on the ECMWF EPS Meteograms adapted to use the calibrated ensemble output; options are included to plot forecasts from additional models, such as the Met Office's Unified Model, alongside the EPS output or incorporated into the Meteogram as additional ensemble members.
3. First Guess Early Warnings of Severe Weather
This project aims to generate Early Warnings of severe weather in support of the UK National Severe Weather Warning Service (NSWWS) using probabilities estimated from the EPS. NSWWS Early Warnings should be issued up to 5 days in advance when the probability of an event “somewhere in the UK” is 60% or more. In addition to an overall UK probability, local probabilities are given for 12 UK regions. In practice warnings are most often issued around 24h in advance when there is a high degree of certainty, with the result that the miss rate is high. This project aims to: Encourage earlier issue of warnings Increase the number of warnings issued, to reduce the Miss Rate Improve the use of probabilities
Warnings are issued for the following events:
CBS ET/EPS/Doc.3(6), p.7


Figure 4: Sample reliability (top) and sharpness (bottom) diagrams for WindSpeed of Beaufort Force 6 or more. Colours indicate the different levels of statistical processing applied as shown in the key. The sharpness diagram shows how frequently each forecast probability was issued.

Figure 5: Brier Scores plotted against forecast leadtime for T_{12}<5C for approx 35 sites in the UK during winter 2000/1. Colours indicate the different levels of statistical processing applied as shown in the key.

Severe Gales  gusts of 70 mph or more Heavy Snow  2cm/hour or more for at least two hours Blizzards/drifting  snow with winds of 30 mph or more Heavy rain  at least 15mm within a 3hour period
These events are very demanding for an NWP model, and proxy events had to be defined to represent these in the model output. A system for scanning the ensemble and estimating probabilities of each of these events was run in an operational trial during the autumn and winter of 2000/01. Alerts were issued to forecasters when forecast probabilities of severe weather exceeded 20% and recommendations to issue warnings at over 60%. Early Warnings from the system were verified against Flash Warnings which are issued for the same events at very short notice when forecasters have a high degree of confidence.
CBS ET/EPS/Doc.3(6), p.8


Figure 6: Sample EPS Meteogram generated from Kalman Filtered ensemble data on FSSSI. Coloured lines show deterministic forecasts from the EPS Control and ECMWF TL511 runs, and two runs of the Met Office model.

Figure 7: A sample PDF generated from data in FSSSI by the upgraded PREVIN system. This example is generated using the fully Calibrated option. Other options are available.

Figure 8 shows ROC (Relative Operating Characteristic) curves for warnings of Severe Gales issued by the EPSbased Early Warnings system at 1, 2, 3 and 4 days ahead in pale blue. Warnings issued by forecasters are shown in dark blue for comparison. It is immediately clear that the EPS forecasts at 4 days ahead are much better than those at shorter range, as the ROC curve is bowed far more towards the top left corner. Results for heavy rain and snow warnings are very similar. It should be noted that these ROC curves are plotted with points at probability thresholds of 0.01, 0.03, 0.05, 0.09, 0.13, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9. The extra points from low probability thresholds make a significant contribution to the area under the ROC curve, so much of the skill is at low probabilities, but may nevertheless be useful for alerting forecasters to the possibility of severe weather.
CBS ET/EPS/Doc.3(6), p.9
Figure 8: ROC curves for warnings of Severe Gales issued by the EPSbased system (pale blue) and Met Office forecasters (dark blue) at 1, 2, 3 and 4 days (see captions).
Further verification showed that the system used in the operational trial was seriously overforecasting severeweather, and the severeweather thresholds have since been tuned to optimise the performance. Figure 9 shows reliability diagrams for the optimised system for severe gales and heavy rain warnings at 2 and 4 days ahead. Because the system has been tuned to this data these graphs can only represent the maximum potential skill of the system, but nevertheless indicate some useful potential. The result seen in the ROC graphs of figure 8, that day 4 forecasts are much better than shorter ranges, is seen consistently regardless of the tuning applied. Bearing in mind the very small samples of high probability warnings (as shown by the sharpness diagrams) which tends to lead to noisy reliability diagrams, the results at 4 days ahead are encouraging. There is quite good resolution and reliability.
CBS ET/EPS/Doc.3(6), p.10
Figure 9: Reliability (top) and Sharpness (bottom) diagrams for warnings of Severe Gales at 2 and 4 days ahead (left) and Heavy Rain at 2 and 4 days (right).
By contrast the reliability diagrams for 2day forecasts (or 3day forecasts) show no resolution at all  reliability curves are almost horizontal. The only positive feature is that when the forecast probability is zero then severe weather is very unlikely.
Overall results for the Early Warnings system, based on 6½ months of data since the latest upgrade to the ECMWF EPS, show that the operational system has good resolution in 4day forecasts, although the thresholds used operationally led to quite severe overforecasting. Using a process of ‘calibration by assessment’ it has been shown that this overforecasting can be effectively eliminated, giving a potential for good probabilistic forecasts at D+4. This calibration has, however, not been tested using independent data, due to the small data samples available for analysis, so results in the coming season are unlikely to be as good as shown here. Also it must be noted that these results may not be typical of other periods, because the assessments cover a period during a substantial part of which there was a bug which was subsequently discovered to have been affecting the spread of the EPS. Thus the true skill of the recalibrated system can only be assessed over the coming winter season.
Results for shorter forecast periods of 1 to 3 days were less good. Indeed the system has no skill at D+1, and at 2 and 3 days has only a limited ability to discriminate occasions when there is no risk of severe weather from occasions when there is some risk. It may therefore be useful in issuing alerts to forecasters at this range, but not in assessing the actual probabilities of severe weather.
It is interesting to consider why the FGEW system performs so much better at day 4 than at earlier times. The EPS is purposely designed for mediumrange use, and at D+1 the perturbations are still very small (although growing rapidly), so poor performance here is unsurprising. At D+2 and D+3 the perturbations should have completed their period of rapid growth and be representative of typical forecast errors, but the performance is still poor. The singular vector perturbations used are designed to look for maximum error growth over the first 48 hours of the forecast, so they represent far from a random sampling of the forecast PDF at that time. However, without a random sampling of the PDF, we should not expect to get reliable estimates of forecast probabilities. It is not until the effects of nonlinearity are able to mix up the forecasts beyond about 48 hours that we can expect the ensemble to give us a quasirandom sampling of the forecast PDF, and it is believed that this is why the probability forecasts are much better at day 4.
CBS ET/EPS/Doc.3(6), p.11
References
Legg, T.P., Mylne, K.R. and Woolcock, C., 2001: The use of mediumrange ensembles at the Met Office I: PREVIN  a system for the production of probabilistic forecast information
from the ECMWF EPS, Met Office Forecasting Research Tech. Report No. 333. Also submitted to Meteorol. Appl.
Mylne, K.R., 2001 Decisionmaking from Probability Forecasts using Calculations of Forecast Value. Met Office Forecasting Research Tech. Report No. 335. Also submitted to Meteorol. Appl.
Persson (2001) User Guide to ECMWF forecast products, ECMWF, 2001.
________________________________ 