Accuracy

Note: I recently completed a larger study. I hope to post results here sometime soon, but meanwhile some analysis can be seen in the manual with the WXSIM download.

Due to the optionally interactive nature of the program, a truly objective study on the accuracy of the program's output is difficult. I have attempted to adopt a very standard set of testing procedures, however, and have devoted a lot of time over the years to documenting WXSIM's accuracy, biases, strengths, and weaknesses. I did this at least as much to improve the program as to determine its validity as a forecast tool.


What follows here is an abbreviated presentation of the major findings. (To download a more detailed presentation of this data, in the form of a 49KB WordPad document, click here).


The output analyzed was forecast maximum (7 AM-7 PM) and minimum (7 AM-8 AM) temperatures at the Atlanta airport (KATL), out to 48 or 72 hours, depending on the study. In all the studies, morning forecasts were made based on data available before 8 AM EST (average ... it varied somewhat, even as early as the midnight before) and afternoon forecasts generally before 6 PM EST (sometimes as early as 1 PM). The actual forecast runs were usually completed within an hour of the time of the data. Comparisons were made with NWS forecasts issued around 5 AM or 5 PM, and with NGM (and in some cases GFS) MOS runs initialized on data from 00Z and 12Z, respectively, available several hours before the WXSIM runs and NWS forecasts. Temperatures were recorded in degrees Fahrenheit.


The largest and most recent study consists of 126 forecasts (48 AM and 78 PM), each out to 72 hours, spanning the period March 24, 2002 to January 26, 2003. The 'competitors' here were Nested Grid Model Model Output Statistics (NGM MOS), GFS (formerly known as AVN) MOS, the official National Weather Service coded city forecast (released about the same times as the WXSIM runs were produced), the latest forecast from The Weather Channel, and me! I made my forecast just after looking at the other sources, adding my judgement and experience to a consensus of the other numbers.

 

Mean Absolute and Net Errors
Source
12 hour
24 hour
36 hour
48 hour
60 hour
72 hour
12-48 hr
12-48 hr
 
average
net
NGM
2.88
2.94
2.90
3.29
3.00
1.21
GFS
2.27
2.43
2.67
3.15
3.30
2.63
-0.27
NWS
2.17
2.33
2.48
2.91
3.47
2.47
0.18
TWC
1.99
2.04
2.33
3.11
4.00
3.59
2.37
-0.14
WXSIM
1.83
2.49
2.95
3.42
4.03
3.92
2.67
0.10
ME
1.55
2.05
2.29
2.98
3.31
3.31
2.22
0.06
  

I also conducted 3 earlier significant studies (two even earlier ones are not included here due to small sample size and inconsistent methodology). Considering only data from my standard AM and PM time windows (before 8 AM and 6 PM), these were 80 forecasts from 11/29/95-2/3/96, 87 from 12/17/97-2/26/98, and 50 from 1/14/99-4/3/99. These used various 4.x and 5.x versions of the program, the earliest of which was actually a DOS version.

The earlier studies did not include me or The Weather Channel. They did include some data from the old AVN MOS product, which tended to be a bit cold-biased and about the same overall accuracy as NGM MOS (though perhaps a bit better beyond 24 hours and not as good before). All studies did have in common, though, NGM MOS, NWS, and WXSIM, out to 48 hours, with the same basic methodology, including fairly standard AM and PM forecast times. Presented below are averages for all four data sets: 153 forecasts initialized at an average of 5:24 AM and 190 forecasts initialized at 3:53 PM.


All 343 Forecasts: Mean Absolute Error
Source
12 hour
24 hour
36 hour
48 hour
ALL
NGM
3.14
3.18
3.37
3.95
3.41
NWS
2.65
2.67
2.94
3.39
2.91
WXSIM
2.15
2.81
3.16
3.65
2.94
 

The most general statement to make here is that WXSIM's overall accuracy, on average, was significantly better than NGM MOS, and almost exactly on a par with the newer GFS MOS, out through 48 hours. It was very nearly as good as the humans at the National Weather Service and the Weather Channel through this period.

Note that at 12 hours, WXSIM clearly beat all these sources, both computer and human. Also, note that my personal forecasts were best of all. This is partly from the experience of having watched the weather in my area for the last 30 years, but also from adding WXSIM's output to the list of sources to consider when making my forecast. Having used it for a long time, I tend to know its strengths, weaknesses, and small biases, and can therefore make the most of its output.

One other interesting item to note is that NWS, NGM, and WXSIM all seemed to improve in the recent data as compared to the combined data, which includes forecasts going back several years. The improvement in the WXSIM results would seem to make sense, because a number of improvements to the program were made during this time. NGM and NWS actually showed greater improvement, however. It's quite plausible that the humans gained more tools and experience, thereby increasing forecast accuracy, and even that NGM MOS itself may have improved slightly more statistics to work with, maybe?

I suspect, though, that another factor was that this last set of forecasts, which included warmer months, such as May and June, and settled months, like October, was simply easier to forecast. These conditions actually expose one relative weakness of WXSIM: it doesn't (yet, at least) have a feedback mechanism to 'learn' from its mistakes. I specifically recall some periods during this data set with very persistent weather patterns, with almost no day-to-day change. After a day or two of this, the humans 'caught on' and would virtually 'nail' the forecast, while WXSIM would make consistent, small errors - like under-forecasting the high temperature by two degrees each day. I suspect MOS products also thrive in such an environment. The more active winter season here may erase some of this advantage.

One last note: WXSIM seems to be slightly more reliable when initialized in the afternoon as compared with other times of day or night), both in absolute terms and relative to other sources. This is probably because it initializes better in the well-mixed daytime air as compared with night, when temperature inversions may be common. Advection data are also more reliable in the daytime, for the same reason. These differences are usually slight, however, and the program generally performs quite well at any hour of the day or night.