The Hallmarks of Quality Metrics
In a previous article I discussed some of the shortcomings of OPS as an advanced metric, which naturally leads to the question: “What are the characteristics of good advanced metrics?” While the relative importance of each criterion is debatable (the list that follows is in no particular order), the considerations should be relatively non-controversial. I’ve used the term “metric” to refer to any statistic or derived category, which is not precise terminology:
1. Clear purpose
Before designing a metric or using it to answer a question, the question of interest must be defined. What is the metric setting out to measure? Most metrics in use, even those that are not in favor with sabermetricians, do fairly well on this score. Counting statistics, regardless of their ultimate utility, are largely clear in terms of definition and meaning. Some, like hits or strikeouts, are inherently obvious. Those with more involved definitions often have a clear purpose even if the execution of that idea is somewhat muddled (like errors).
2. Developed with a theory in mind
This criterion is closely related to a clear purpose, but takes it a step further by questioning the thought process that went into developing the metric. OPS doesn’t fail, as it is based on the reasonable notion that hitting can be broken down into the broad categories of getting runners on base (OBA) and advancing them (SLG). However, due to the somewhat arbitrary nature by which the two statistics are combined, OPS does not match up to metrics like wOBA and True Average which are based on a linear weight model of the run scoring process. Other proposed metrics fail spectacularly by simply combining statistical columns without any particular rhyme or reason. Thankfully, most of these fail to gain traction, but some even have their own Wikipedia pages. Metrics of this type may appear to “work” by producing reasonable leader boards, but the same could be said for any haphazard combination of positive events and categories.
A metric should result in an accurate estimate of whatever it is designed to measure. For instance, a metric that attempts to measure offense productivity should have a strong correlation with team runs scored as scoring runs is the prime objective for an offense. The best-performing models for estimating team runs scored tend to be based on either dynamic models of the run scoring process (e.g., David Smyth’s Base Runs) or linear weight models, pioneered by George Lindsey and Pete Palmer and now in wide use. Thus it stands to reason that metrics built on linear weights (such as wOBA) are a better tool to use when evaluating offensive production than alternatives that are not as highly correlated with runs scored.
Sometimes, though, it is not easy to measure accuracy due to a lack of data to verify against or a desire to use the metric to address a similar but subtly distinct question. For example, metrics validated against team results are often used to measure individual performance, which leads to the next criterion.
4. Adaptable over a wide range of contexts
Although there is nothing inherently wrong with a metric that is designed to work only under a limited set of conditions–so long as said metric is not stretched beyond its capabilities–it is preferable to be confident that the metric will produce reasonable results for a broad set of questions.
Sometimes metrics work well over normal ranges of performance and thus provide reasonable answers for most questions. For example, the common rule of thumb that 10 runs = 1 win is quite accurate at predicting the win totals of major league teams from their runs scored and runs allowed. However, the actual relationship between runs and wins is not linear—it only appears to be linear because the conversion is calibrated over a narrow set of possible outcomes. When the model is applied to more extreme conditions (which in this case could be an average level of runs scored per game much different than major league norms or teams with very low or very high run differentials), the accuracy will suffer. A dynamic model of estimated winning percentage (such as Pythagenport) can maintain accuracy over a wider range of scenarios.
A related but slightly different issue occurs when some metrics that are designed for use with team data are applied to individuals. A classic example is Bill James’ original version(s) of Runs Created, which recognizes the dynamic relationship between getting runners on base and advancing them. When applied to an individual’s statistics, though, the implication is that the player is reaching base, then advancing himself around the bases, whereas he actually interacts with his teammates. The resulting distortion requires that caution be used when interpreting Runs Created estimates for individual players.
5. Expressed in meaningful units
Ideally, the metric should return a result that has a logical, interpretable baseball meaning. Metrics expressed in terms of runs and wins are ideal since the connection to the objective of the game is made clear, but any number of other expressions can be meaningful. On Base Average, for instance, represents the percentage of plate appearances in which a batter reaches safely, which is easy to explain and easy to think about in terms of on-field implications.
In some rare instances, it is next to impossible to express a result in meaningful units and so a nebulous value must suffice. One example is Bill James’ Speed Score, which estimates a player’s speed skill by taking into account a number of categories related to speed (such as stolen base attempt frequency, rate of triples per ball in play, defensive range, etc.) Since there is no single manifestation of speed on the field and no obvious units to capture baseball speed, James uses an abstract scale.
6. Not needlessly complex
It is certainly tempting to say that metrics should be simple, but simplicity need not be a goal unto itself. What is important is that the metric not make things more complicated than they need to be.
However, describing complex processes may necessitate the use of complex models. The key is to avoid complexity for its own sake and phony precision. The end use and user of the metric should also be considered—if a “quick and dirty” estimate will suffice, then a simple metric may suffice, but a more complex metric can be used when a true best estimate is needed.
7. Catchy Name
This final entry is somewhat tongue-in-cheek, as it is irrelevant to the quality of a metric, but there’s no denying that marketing matters when it comes to mainstream acceptance. To bring things full circle, a good name succinctly references the intended purpose and use of the metric while providing a minimum amount of ammunition to those looking to mock the field. Whether any sabermetric measures score particularly well on this front will be left as a rhetorical question for the reader.
Photo by Sean Winters