Fans respond to winning, but different fan bases respond differently. Fan response is most easily measured by a team's revenue stream, the largest factor of which is home attendance - essentially a measurement of demand. It follows that if one can understand the relationship between wins and attendance, then one can reasonably predict revenue at different win levels.
When plotted on a wins versus revenue graph, the function that predicts these points is called a win-curve. (NOTE: The win-curve is actually disjoint. Since modern-era Major League Baseball games do not end in ties, there are no fractional wins. The line itself serves strictly as an illustration because win totals will always be integers.)
The concepts discussed here are elaborated on in far greater detail in Vince Gennaro's Diamond Dollars - a book that I highly recommend if you find that this article sparks your interest (check the Offline Reading list). Mr. Gennaro developed win-curves for all 30 teams as a part of his research into these relationships. Using 37 years of historical data, I attempted to build my own for the Texas Rangers.
This is Part I: Wins vs. Attendance.
PREPARING THE DATA
Each team's win-curve is different. The trick to building an accurate win-curve is identifying trends in the relevant market. In Mr. Gennaro's model, he used a 50-50 weighted average of the previous year's wins and the current year's wins, and he compared that to a per-game average of the current year's home attendance.
He accounted for a "new franchise halo", overall industry growth, new stadium effects, and work stoppages. Due to the small sample size, these effects were identified and generalized for Major League Baseball as a whole, rather than for individual teams.
Based on the assumption that each market behaves differently, it is unreasonable to assume these effects will be the same from market to market, but they do represent a reasonable approximation.
In my attempts to recreate Mr. Gennaro's work, I struggled to capture these effects. Without doing similar studies for the other 29 teams, I tried to find other ways to account for these other attendance factors.
As Gennaro surely did, I played with several different values for my wins variable. Each was represented as a winning percentage to help adjust for seasons of different lengths. Here is the list of my various definitions for wins:
- Gennaro's average of previous wins and current wins
- Separate variables for previous wins and current wins
- Average wins of the three most recent seasons
- Separate variables for the wins of the three most recent seasons
For each definition of wins, I tried different weights. In nearly every model, the only significant variable from the group was current wins, though the previous wins variable was significant in a few. In the final model, I chose current winning percentage as my wins variable.
"New Franchise Halo"
This effect was minimal at best with the Rangers. In their first two years, 1972 and 1973, the Rangers averaged fewer than 4,000 fans per home game. Early on, the Rangers weren't very good, but after several years in Arlington, attendance climbed to well above 15,000 fans per game. This is counter to the typical new franchise halo, where a team sees an early boom that tapers off after a few seasons.
The Metroplex area has only had one Major League Baseball franchise, so there is only one from which to develop a model, leaving too few samples to effectively quantify the "new franchise halo" for Dallas/Fort Worth.
To measure industry growth, a simple counting variable was added - a value of 0 in 1972, up to 36 in 2008. Effectively, this functioned the same as a "Years in Area" variable. In the final model, this variable is not significant - p = 0.255 (approx.) - but its inclusion in the model resulted in a smaller standard error and better R-square and Adjusted R-square values than when it was excluded.
New Stadium Effect / Work Stoppages
The Rangers are a unique franchise in this aspect. In 1994, the year they opened the only new stadium in their history, the MLBPA went on strike and the World Series was canceled. The strike continued into 1995, dramatically reducing and effectively negating the new stadium effect that was experienced prior to the strike. As a result, in every regression that was run, 1994 was a positive outlier and 1995 was a negative outlier.
Until 2008, these two seasons were the only outliers in every single model tested.
In 1996, the Rangers began the winningest period in the franchise's history, and in each of the models I ran, the results suggested that this winning period was responsible for the high attendance experienced during that period rather than the new stadium. The coincidence is somewhat striking, and in actuality, it was probably a combination of the two that resulted in the high attendance averages.
I added a value for stadium age (in years) to try and capture the new stadium effect, but even after figuring in Arlington Stadium's prior existence as Turnpike Stadium (making it 7 years old in 1972), the variable failed to be significant.
I tried several other variables to help build a better statistical model. Though none turned out to be significant, the following variables were included at some point during testing:
- Years since playoff appearance
- Made playoffs (1 or 0)
I also tried to include TMR's Fan Cost Index, but I was only able to find data for the 18 most recent of the 37 seasons. The lack of sample data for this variable resulted in its exclusion from the tested models.
THE FINAL MODEL
When narrowing my model down to the most relevant sample data, I greatly simplified my thinking. Instead of trying to identify individual factors that affect attendance, I realized that, historically, all of this information already existed as a single variable: attendance. It is definitely not the perfect solution, but last season's attendance is the most significant indicator of attendance for the current season.
Based on the models I ran, none was able to predict the huge drop off experienced in 2008. Something that I think might be responsible is the price of gasoline and the increased reliance on gas-guzzling vehicles. Not only did families have less disposable income to spend on baseball games, but the trip to the games became more expensive. Without an effective public transportation solution as an alternative means to get to the ballpark, the mostly commuter fan-base spent their money elsewhere.
After including the previous year's attendance in my models, its significance was immediately apparent, but two other variables remained viable: current winning percentage and growth factor (the counting variable discussed above). Logically, this makes sense.
With this three-variable model, three seasons stood out as dramatic outliers: 1994 (new stadium), 1995 (strike), and 2008 (transportation cost).
It is statistically questionable to eliminate outliers, but in this case, I think it makes sense. I understand that this brings the study as a whole into question, but I'm going to run with it anyway.
In 2009, the Rangers will not be moving into a new stadium; there is no labor conflict on the horizon; and for now, the price of gasoline has returned to a reasonable level.
Because another spike in gasoline prices is possible, 2008 was not removed from the data set. 1972, 1981, and 1995 were all removed because of work stoppages, and 1994 was removed because of the new stadium. (The years that followed still used accurate previous year attendance, so the effects of these events were carried forward to future years even though the immediate effects were not a part of the model.)
The final model included data from 1973 through 2008, skipping over 1981, 1994, and 1995. The dependent variable, of course, was current season average home attendance. The independent variables were the previous season's average home attendance, the current season's winning percentage, andthe growth factor described above.
The model says that for 2009, Texas Rangers home attendance can be estimated within 2,646 attendees per game using a chosen win level, the 2008 attendance per game (24,021), and the growth factor (37).
The graph below represents the relationship between wins and attendance for 2009.
The yellow dot marks the 2008 average home attendance, and the red dot marks last season's win total. According to the model, the Rangers should see an increase in attendance over last season for as few as 69 wins.
I think the transportation cost will be a huge factor going into the 2009 season, since the most viable method of getting to the ballpark is to drive.
RELEVANT STATISTICS NOTES
For the final model, the R-square value is 0.904, the adjusted R-square is 0.894, with a standard error of 2,646.
The independent variables have the following p-values: previous attendance average < 0.000002, winning percentage < 0.0008, and growth factor < 0.255.
As discussed above, the removal of the growth factor variable resulted in smaller R-square values and a higher standard error, so it was left in the final model.
IN PART II
In Part II, I will tackle the topic of post-season probability at different win levels. Combined with this article, it will be possible to start turning these numbers into dollars.