Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 245-251

Published online September 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.3.245

© The Korean Institute of Intelligent Systems

Pythagorean Exponents Induced by Mathematical and Statistical Methods in the Major League Baseball

Jin Hee Yoon1 and Seung Hoe Choi2

1Department of Mathematics and Statistics, Sejong University, Seoul, Korea
2Liberal Arts and Sciences, Korea Aerospace University, Goyang, Korea

Correspondence to :
Seung Hoe Choi (shchoi@kau.ac.kr)

Received: July 27, 2021; Revised: September 19, 2022; Accepted: September 20, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This study introduces two kinds of Pythagorean exponent and their applications. And we propose the mathematical and statistical analysis for them, and suggest they can be applied to sports data, which is related to the winning percentage. The first one is Mpe which is obtained from mathematical analysis, and calculated based on a deterministic model. The second one is a Spe obtained from statistical analysis, and it is derived based on probabilistic model. For the application, we compare the two results using the records of teams participating in Major League Baseball (MLB) from 1901 to the present. The value of Mpe calculated by the mathematical method is approximately 2:039 and the value of Spe inferred from the statistical method using the regression model of the winning percentage is 1:854. Furthermore, the value of Mpe of the 16 teams which have participated in MLB since 1901 statistically varies depending on post season advances and actual winning percentage, and the sufficiency of the two results compared using mean absolute percentage error (MAPE) and correlation between the predicted and observed winning percentage.

Keywords: Pythagorean exponent, MLB, Winning percentage, Regression, MAPE

In a game where a ball is controlled using the human body, such as soccer, basketball, volleyball, American football, or handball, or in a ball game using equipment such as baseball, table tennis, cricket, water polo, or lacrosse, scores and loss are the basis for victory [1,2]. Since the winner of those games is supposed to get more scores than losses, scoring, and losing points affect the winning percentage. Thus, athletic performance is widely studied by predicting and assessing it [3,4].

A creator of Sabermetrics, Bill James [5] suggested Pythagorean expectation which can be compared with the actual winning percentage of 16 teams in Major League Baseball (MLB). He estimated Pythagorean winning rate Wp by dividing the squares of each team’s score by the sum of the squares of the runs scored and squares of runs allowed. The Pythagorean winning rate is equal to

Wp=RsγRsγ+Raγ,

the ratio of runs scored Rs and to the sum of runs scored and runs allowed Ra of the teams participating in the ball game, where Wp means winning percentage and γ means Pythagorean exponent. Although Bill James suggested the Pythagorean exponent was 2, it was known that the exponent was not the same in all sports. Miller [6] showed that the runs scored and runs allowed of the MLB team followed the Weibull distribution, and when the Pythagorean exponent was 1.83, the Pythagorean winning rate was suitable for the actual winning rate. In addition to baseball, the Pythagorean exponent has been studied in various ball games such as soccer, basketball, hockey, tennis, and cricket [711]. The Pythagorean exponent, which is related to the winning percentage of ball sports, may not fit depending on the performance of each team [12]. A winning percentage of a team with good offensive ability might be more dependent on runs scored, while the team with good defensive ability is more affected by runs allowed [13]. In addition, the runs scored and allowed, and the difference of scoring and losing points be related to the Pythagorean exponent independently [14,15]. The exponent can be estimated in two ways: (1) numerical Method for calculating Pythagorean exponent by viewing runs scored and runs allowed as specific values, (2) statistical method which assumes that runs allowed and allowed are two statistical variables to infer the exponent [16,17]. The former performs calculation directly based on the deterministic model and the latter performs inference based on the stochastic model. Meanwhile, it is meaningful to compare the statistical characteristics of the Pythagorean exponent according to each team’s situation. Since the Pythagorean exponent can be an indicator of the characteristics of the teams, comparing mathematical and statistical results is interesting. The results of this study provide a basis for comparing mathematical and statistical results in different ball games.

This study introduces the Pythagorean exponent related to the winning percentage Wp, and noted the comparison between the Pythagorean exponent Mpe calculated using a deterministic model and the other Spe estimated by statistical method using the records of the teams participating in the MLB since 1901. Mpe calculated by mathematical and numerical analysis is 2.039 and Spe inferred from regression analysis is 1.854.

Mpe showed a significant difference depending on the number of the post-season advances and Spe was correlated with the regression coefficients of runs scored and runs allowed. Furthermore, the difference between the actual winning percentage Wand predicted Pythagorean expectation Wp was smaller when the statistical model was used, compared to that of the deterministic model. But there was no difference in the correlation between the actual value and the estimate in two ways.

2.1 Pythagorean Exponent Using Mathematical Analysis

In order to directly calculate the Pythagorean exponent γ that Bill James suggested, several conditions are required. As none of the teams involved in the MLB has ever recorded 100% of wins or defeats, the winning percentage Wis in the range of 0 to 1. When the number of wins and defeats is identical, Pythagorean expectation Wp becomes 0.5. In addition, none of the teams has ever recorded 0 runs during every season, Rs ≠ 0 can be assumed. Therefore, the required assumptions to calculate the Pythagorean exponent γ are as below.

1)0<W<1,         2)RsRa,         3)Rs0.

Since all of the teams involved in the MLB have recorded a positive number of runs scored every season, their winning percentage, runs scored and runs allowed satisfy

ln (1-WW)=γ ln (RaRs).

From the above equation, we can directly calculate the Pythagorean exponent Mpe using a deterministic model as follows:

Mpe=1kni=1kj=1nMpe(i,j).

2.2 Pythagorean Exponent Using Statistical Analysis

Runs scored and runs allowed are recorded in ball games and these can be considered independent variables. Additionally, a team’s winning percentage can be considered a statistical variable as well [4]. In this sense, a statistical model that runs scored and runs allowed to affect the winning percentage was designed. If runs scored are higher than those allowed in a ball game, then the game can be won. However, this does not mean that higher runs scored directly lead to a higher winning percentage. As low runs allowed is still another key to winning as well, it can be said that runs scored and runs allowed affect winning percentage independently.

From 1901 to 2019, the runs scored and runs allowed by all teams in MLB did not have a statistically linear relationship with one another (p-value > 0.1) [17]. The regression model of winning percentage using runs scored and allowed was

Wij=β1jRsij+β2jRaij+ɛij,i=1,,119,j=1,,16.

Also, the regression model that indicates the effects that the difference of runs scored and allowed is

Wij=12+α(Rsij-Raij)+ηij,i=1,,119,j=1,,16,

where, ɛij and ηij are error terms that follow normal distribution. In addition, a statistical regression model that is able to estimate the Pythagorean exponent that Bill James [5] suggested is

ln (1-WijWij)=γln (RaijRsij)+ɛij,i=1,,119,j=1,,16.

3.1 Dataset

Since Cincinnati Red Stockings, the first professional baseball team was founded in 1869, a number of baseball teams have been established, and 30 teams are currently in the MLB. The modern form of the major league began in 1901 with the birth of the American League (AL), and in 1903 the world series between the two champion teams took place for the first time [18,19]. In 1969, when the mound was lowered to 10 inches, the MLB was divided into western and eastern regions, and since 1994 the western, central, and eastern regions were reorganized. This study was carried out using the records of 16 teams including 15 teams that have been participating in the MLB since 1901 and the New York Yankees which joined in 1903. The whole dataset used in this paper was offered by MLB.com, Fangraphs, and Baseball-reference.

Table 1 shows the regression coefficients estimated by the least squares method according to the regression model (3), (4), and (5).

Table 1 shows the average runs scored and runs allowed by the MLB teams from 1901 to 2019. The Pythagorean exponent can be calculated using the average of winning percentage, runs scored, and runs allowed. The Pythagorean exponent value (2.039) is calculated using each team’s average runs scored and runs allowed according to Eq. (1). Another method is to calculate the Pythagorean exponent of each team since 1901, subsequently calculate the average value of each team’s Pythagorean exponent, which is as below.

Mpe=1kni=1kj=1nMpe(i,j)=2.061.

i and j indicate team and year respectively, k = 16 and n = 119. The first method resulted in a lower value than the second method.

3.2 Statistical Analysis of Pythagorean Exponent

The variables that explains the winning percentage may be statistically related to each other. First, we examined the linear relationship of the calculated Pythagorean exponents according to each team’s runs scored and runs allowed. The correlation between the Pythagorean exponent calculated using the deterministic model and the one estimated using the stochastic model {(Mpe(j), Spe(j)) : j = 1, ⋯, 16} was not statistically significant (p = 0.32). In addition, the correlation coefficient between the statistically estimated Pythagorean exponent and the regression coefficient of the difference between the runs scored and runs allowed for the winning percentage α̂ was 0.748 (p < 0.05). The correlation coefficient between the numerically calculated Pythagorean exponent and the regression coefficient of scores allowed for the winning percentage was −0.540, which was negatively correlated with each other (p < 0.05).

It is meaningful to compare the regression coefficients of runs scored and runs allowed in the regression model (2) with the Pythagorean exponent. It is to investigate changes in the Pythagorean exponent between teams that are heavily influenced by runs scored or teams that are heavily influenced by runs allowed. In Table 1, the Pythagorean exponent value (2.335) of the teams that were more affected by scoring points (Boston, Cleveland, Dodgers, Minnesota, and St. Louis) was greater than the Pythagorean exponent value (1.896) of the teams that were more affected by losing points. p < 0.05).

The calculated Pythagorean exponent Mpe or the estimated Pythagorean exponent Spe could not be considered statistically significant depending on the team with a high winning percentage (over 0.5) and low percentage (less than 0.5) (p > 0.1). Also, the estimated Pythagorean exponents were compared according to the team with the high number of post-season advances (more than 28 times), the team with the low number of advances (less than 14 times), and the team with the normal number of advances (less than 28 times and more than 15 times). There was no statistically significant difference in the Pythagorean exponent Spe (p > 0.1). However, the results of the median test show that the numerically calculated Pythagorean exponent Mpe differs statistically depending on the number of post-season advances (p = 0.051). The influence of the New York Yankees with more post-season advances than other teams seems to be the major reason. As New York Yankees’ post-season count was apparently higher than other teams in Table 1, the median test, which is a non-parametric estimation method, was used in this study.

This shows that the winning percentage of each team is predictable using its runs scored and runs allowed and the estimated value of the Pythagorean exponent because the statistical Pythagorean exponent of each team is not affected by its winning percentage and number of post-season advances. Meanwhile, the Pythagorean exponent calculated using numerical analysis statistically differs depending on the team’s grade. Therefore, it might be better if the prediction of a team’s winning percentage using a directly calculated Pythagorean exponent is carried out separately depending on each team’s post-season advance.

3.3 Efficiency of Predicted Values

Since Bill James [5] suggested the relationship between the Pytha-gorean exponent and the expectation, it is necessary to compare the winning percentage of each team using the Mpe calculated based on the deterministic model and the Spe estimated based on the stochastic model. If the difference between the estimated winning percentage using the Pythagorean exponent and the actual winning percentage is small, it shows that runs scored, runs allowed, and the Pythagorean exponent can be used to predict the team’s winning percentage.

Efficiency of prediction can be explained with two metrics: (1) using an error between predicted value ŷi and observed value yi, (2) using a correlation coefficient between predicted value ŷi and observed value yi. Mean absolute percentage error (MAPE), which explains the error of predicted value is

Mape=t=1n|y^t-ytyt|×100.

A metric using the linearity between predicted value ŷi and observed value yi is

Corr=t=1n(y^t-y^¯)(yt-y¯)(y^t-y^¯)2(yt-y¯)2,

where, y^¯ and is the average of predicted value ŷi and observed value yi, respectively. As Mape is closer to 0 and Corr is closer to 1, the estimated results are considered efficient. In other words, the estimated result is supposed to have a small error and similar distribution to observed values.

Table 2 shows the efficiency of the predicted values based on the numerical method and the statistical method. Mape that indicates the error between the predicted value and observed value and the statistical method resulted in 4.259 which seems more efficient than numerical method resulting in 5.234. Corr that shows the linearity between the predicted value and observed value did not differ depending on both methods.

The difference of Mape between the predicted values estimated by the deterministic method and the probabilistic method can be seen as the influence of the New York Yankees and St. Louis, which recorded numerous post-season advances.

This study investigated the relationship between the actual winning percentage and the Pythagorean exponent proposed by baseball statistician Bill James [5]. From 1901 to 2019, the Pythagorean exponent was determined based on the deterministic and stochastic models based on the runs scored and runs allowed of the 16 teams participating in the MLB. As a result, the value calculated by numerical analysis method Mpe was approximately 2.039, and the one estimated by statistical method Spe using the regression model was 1.854. In addition, the statistical relationship between the Pythagorean exponent, runs scored, runs allowed, and the difference was examined. Statistical analysis showed that the Pythagorean exponent differed depending on the number of post-season advances, and the regression coefficient of the runs scored and runs allowed for the winning percentage was related to the Pythagorean exponent. Furthermore, the result of comparing the difference between the actual winning percentage and the predicted Pythagorean expectation showed that the predicted value estimated by the statistical method resulted in a smaller loss than the predicted value calculated using the deterministic model.

Since this paper demonstrated that the statistically estimated Pythagorean expectation can predict winning percentage, it may be meaningful to compare the results of other prediction methods with that of Pythagorean exponent in future studies.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1A01011131 and 2021R1F1A1057507).
Table. 1.

Table 1. Coefficient of regression and Pythagorean exponent of each team.

Statistical Pythagorean exponentNumerical Pythagorean exponentPost-season advances
TeamAverage winning percentageAverage runs scoredAverage runs allowedαβ1β2γMpe
Atlanta Braves0.4864.1804.3100.9640.6640.7421.8442.01625
Baltimore Orioles0.4744.3094.5940.9570.7480.9481.8981.90114
Boston Redsox0.5194.6644.4720.9560.820.7981.9032.08524
Chicago Cubs0.5054.3514.2920.9410.650.8451.8081.19720
Chicago Whitesox0.5024.3584.3400.940.8471.0431.7461.8899
Cincinnati Reds0.5004.3044.3260.9110.7410.7881.8592.23915
Cleveland Indians0.5134.5364.4170.930.9640.9161.7881.95814
Detroit Tigers0.5044.6134.5830.9420.7870.9091.8372.19116
LA Dodgers0.5264.3074.0760.9430.9680.81.8231.99133
Minnesota Twins0.4804.3894.5780.9510.8910.8011.8821.86516
New York Yankees0.5704.8644.1730.9641.0470.6531.8832.81455
Oakland Athletics0.4884.4334.5680.9650.7150.7581.9152.47728
Philadelphia Phillies0.4654.2264.5850.9640.7090.9481.8821.92314
Pittsburgh Pirates0.5084.3624.2740.9410.7350.7911.8751.84817
San Francisco Giants0.5344.4694.1490.9390.7650.7931.7921.27626
St. Louis Cardinals0.5214.4424.2490.940.8550.7431.8183.29729

Table. 2.

Table 2. Efficiencies of the predicted values.

MeasureMapeCORR
TeamMsMmCsCm
Atlanta Braves4.5484.7660.9590.959
Baltimore Orioles4.8644.8640.9510.951
Boston Redsox3.6063.7950.9610.961
Chicago Cubs4.4656.0560.9460.945
Chicago Whitesox4.0624.1900.9410.941
Cincinnati Reds4.5035.0240.9230.923
Cleveland Indians4.2964.2690.9280.928
Detroit Tigers4.1664.7160.9430.942
LA Dodgers3.8793.9330.9410.941
Minnesota Twins4.0364.0420.9550.955
New York Yankees3.7177.2650.9430.945
Oakland Athletics4.4146.6690.9700.970
Philadelphia Phillies4.8784.8850.9520.952
Pittsburgh Pirates4.3904.4000.9480.948
San Francisco Giants4.0725.1180.9330.933
St. Louis Cardinals4.0829.7850.9450.946
MLB League4.2805.2360.9460.946

*Ms and Cs mean efficiency of statistical method and Mm and Cm mean efficiency of mathematical calculation.


  1. Wikipedia. (2022) . Ball game. Available: https://en.wikipedia.org/wiki/Ball_game
  2. Wikipedia. (2022) . List of racket. Available: https://en.wikipedia.org/wiki/List_of_racket
  3. Liu, H, Gomez, MA, Lago-Penas, C, and Sampaio, J (2015). Match statistics related to winning in the group stage of 2014 Brazil FIFA World Cup. Journal of Sports Sciences. 33, 1205-1213. https://doi.org/10.1080/02640414.2015.1022578
    Pubmed CrossRef
  4. Loughin, TM, and Bargen, JL (2008). Assessing pitcher and catcher influences on base stealing in Major League Baseball. Journal of Sports Sciences. 26, 15-20. https://doi.org/10.1080/02640410701287255
    CrossRef
  5. James, B (19/83). The Bill James Baseball Abstract. New York, NY: Ballantine Books
  6. Miller, SJ (2007). A derivation of the Pythagorean won-loss formula in baseball. Chance. 20, 40-48. https://doi.org/10.1080/09332480.2007.10722831
    CrossRef
  7. Hamilton, HH (2011). An extension of the Pythagorean expectation for association football. Journal of Quantitative Analysis in Sports. 7. article no 2
    CrossRef
  8. Kubatko, J, Oliver, D, Pelton, K, and Rosenbaum, DT (2007). A starting point for analyzing basketball statistics. Journal of Quantitative Analysis in Sports. 3. article no 3
    CrossRef
  9. Cochran, JJ, and Blackstock, R (2009). Pythagoras and the national hockey league. Journal of Quantitative Analysis in Sports. 5. article no 2
    CrossRef
  10. Kovalchik, SA (2016). Is there a Pythagorean theorem for winning in tennis?. Journal of Quantitative Analysis in Sports. 12, 43-49. https://doi.org/10.1515/jqas-2015-0057
    CrossRef
  11. Vine, AJ (2016). Using Pythagorean expectation to determine luck in the KFC Big Bash league. Economic Papers. 35, 269-281. https://doi.org/10.1111/1759-3441.12139
    CrossRef
  12. Tung, DD. (2010) . Confidence intervals for the Pythagorean formula in baseball. Available: https://vixra.org/abs/1005.0020
  13. Pavitt, C (2011). An estimate of how hitting, pitching, fielding, and basestealing impact team winning percentages in baseball. Journal of Quantitative Analysis in Sports. 7. article no 4
    CrossRef
  14. Chen, J, and Li, T (2016). The shrinkage of the Pythagorean exponents. Journal of Sports Analytics. 2, 37-48. https://doi.org/10.3233/JSA-160017
    CrossRef
  15. Palmer, P (2017). Calculating skill and luck in major league baseball. The Baseball Research Journal. 46, 56-61.
  16. Lee, JT (2014). Estimation of exponent value for Pythagorean method in Korean pro-baseball. Journal of the Korean Data and Information Science Society. 25, 493-499. https://doi.org/10.7465/jkdi.2014.25.3.493
    CrossRef
  17. Dayaratna, KD, and Miller, SJ. (2012) . First order approximations of the Pythagorean won-loss formula for predicting MLB teams’ winning percentages. Available: https://arxiv.org/abs/1205.4750
  18. Lee, WJ, Jhang, HJ, Lee, SH, and Choi, SH (2020). A comparison between Arima, Grey and LSTM for the winning percentage in the MLB. Journal of the Korean Institute Intelligent systems. 30, 303-308. https://doi.org/10.5391/JKIIS.2020.30.4.303
    CrossRef
  19. Lee, WJ, Jhang, HJ, Lee, SH, and Choi, SH (2020). Forecasting winning rates in major league baseball based on fuzzy logic. Journal of the Korean Institute Intelligent Systems. 30, 366-372. https://doi.org/10.5391/JKIIS.2020.30.5.366
    CrossRef

Jin Hee Yoon received her B.S., M.S., and Ph.D. degrees in mathematics from Yonsei University, Korea. She is currently a faculty member in the department of mathematics and statistics at Sejong University, Seoul, Korea. Her research interests include fuzzy regression analysis, fuzzy time series, optimization, intelligent systems, and machine learning. She is a board member of the Korean Institute of Intelligent Systems (KIIS) and has been working as an associate editor, guest editor, and editorial board member of several journals, including SCI and SCIE. In addition, she has been working as an organizer and committee member of several international conferences.


Seung Hoe Choi obtained his Ph.D. degree in Mathematical Statistics from Yonsei University Korea in 1994. Since 1996, he is currently full professor of Korea Aerospace University. His main research interests are mathematical prediction method using the soft computing and statistical prediction in sports like soccer, baseball and basketball.


Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 245-251

Published online September 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.3.245

Copyright © The Korean Institute of Intelligent Systems.

Pythagorean Exponents Induced by Mathematical and Statistical Methods in the Major League Baseball

Jin Hee Yoon1 and Seung Hoe Choi2

1Department of Mathematics and Statistics, Sejong University, Seoul, Korea
2Liberal Arts and Sciences, Korea Aerospace University, Goyang, Korea

Correspondence to:Seung Hoe Choi (shchoi@kau.ac.kr)

Received: July 27, 2021; Revised: September 19, 2022; Accepted: September 20, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This study introduces two kinds of Pythagorean exponent and their applications. And we propose the mathematical and statistical analysis for them, and suggest they can be applied to sports data, which is related to the winning percentage. The first one is Mpe which is obtained from mathematical analysis, and calculated based on a deterministic model. The second one is a Spe obtained from statistical analysis, and it is derived based on probabilistic model. For the application, we compare the two results using the records of teams participating in Major League Baseball (MLB) from 1901 to the present. The value of Mpe calculated by the mathematical method is approximately 2:039 and the value of Spe inferred from the statistical method using the regression model of the winning percentage is 1:854. Furthermore, the value of Mpe of the 16 teams which have participated in MLB since 1901 statistically varies depending on post season advances and actual winning percentage, and the sufficiency of the two results compared using mean absolute percentage error (MAPE) and correlation between the predicted and observed winning percentage.

Keywords: Pythagorean exponent, MLB, Winning percentage, Regression, MAPE

1. Introduction

In a game where a ball is controlled using the human body, such as soccer, basketball, volleyball, American football, or handball, or in a ball game using equipment such as baseball, table tennis, cricket, water polo, or lacrosse, scores and loss are the basis for victory [1,2]. Since the winner of those games is supposed to get more scores than losses, scoring, and losing points affect the winning percentage. Thus, athletic performance is widely studied by predicting and assessing it [3,4].

A creator of Sabermetrics, Bill James [5] suggested Pythagorean expectation which can be compared with the actual winning percentage of 16 teams in Major League Baseball (MLB). He estimated Pythagorean winning rate Wp by dividing the squares of each team’s score by the sum of the squares of the runs scored and squares of runs allowed. The Pythagorean winning rate is equal to

Wp=RsγRsγ+Raγ,

the ratio of runs scored Rs and to the sum of runs scored and runs allowed Ra of the teams participating in the ball game, where Wp means winning percentage and γ means Pythagorean exponent. Although Bill James suggested the Pythagorean exponent was 2, it was known that the exponent was not the same in all sports. Miller [6] showed that the runs scored and runs allowed of the MLB team followed the Weibull distribution, and when the Pythagorean exponent was 1.83, the Pythagorean winning rate was suitable for the actual winning rate. In addition to baseball, the Pythagorean exponent has been studied in various ball games such as soccer, basketball, hockey, tennis, and cricket [711]. The Pythagorean exponent, which is related to the winning percentage of ball sports, may not fit depending on the performance of each team [12]. A winning percentage of a team with good offensive ability might be more dependent on runs scored, while the team with good defensive ability is more affected by runs allowed [13]. In addition, the runs scored and allowed, and the difference of scoring and losing points be related to the Pythagorean exponent independently [14,15]. The exponent can be estimated in two ways: (1) numerical Method for calculating Pythagorean exponent by viewing runs scored and runs allowed as specific values, (2) statistical method which assumes that runs allowed and allowed are two statistical variables to infer the exponent [16,17]. The former performs calculation directly based on the deterministic model and the latter performs inference based on the stochastic model. Meanwhile, it is meaningful to compare the statistical characteristics of the Pythagorean exponent according to each team’s situation. Since the Pythagorean exponent can be an indicator of the characteristics of the teams, comparing mathematical and statistical results is interesting. The results of this study provide a basis for comparing mathematical and statistical results in different ball games.

This study introduces the Pythagorean exponent related to the winning percentage Wp, and noted the comparison between the Pythagorean exponent Mpe calculated using a deterministic model and the other Spe estimated by statistical method using the records of the teams participating in the MLB since 1901. Mpe calculated by mathematical and numerical analysis is 2.039 and Spe inferred from regression analysis is 1.854.

Mpe showed a significant difference depending on the number of the post-season advances and Spe was correlated with the regression coefficients of runs scored and runs allowed. Furthermore, the difference between the actual winning percentage Wand predicted Pythagorean expectation Wp was smaller when the statistical model was used, compared to that of the deterministic model. But there was no difference in the correlation between the actual value and the estimate in two ways.

2. Mathematical and Statistical Methods

2.1 Pythagorean Exponent Using Mathematical Analysis

In order to directly calculate the Pythagorean exponent γ that Bill James suggested, several conditions are required. As none of the teams involved in the MLB has ever recorded 100% of wins or defeats, the winning percentage Wis in the range of 0 to 1. When the number of wins and defeats is identical, Pythagorean expectation Wp becomes 0.5. In addition, none of the teams has ever recorded 0 runs during every season, Rs ≠ 0 can be assumed. Therefore, the required assumptions to calculate the Pythagorean exponent γ are as below.

1)0<W<1,         2)RsRa,         3)Rs0.

Since all of the teams involved in the MLB have recorded a positive number of runs scored every season, their winning percentage, runs scored and runs allowed satisfy

ln (1-WW)=γ ln (RaRs).

From the above equation, we can directly calculate the Pythagorean exponent Mpe using a deterministic model as follows:

Mpe=1kni=1kj=1nMpe(i,j).

2.2 Pythagorean Exponent Using Statistical Analysis

Runs scored and runs allowed are recorded in ball games and these can be considered independent variables. Additionally, a team’s winning percentage can be considered a statistical variable as well [4]. In this sense, a statistical model that runs scored and runs allowed to affect the winning percentage was designed. If runs scored are higher than those allowed in a ball game, then the game can be won. However, this does not mean that higher runs scored directly lead to a higher winning percentage. As low runs allowed is still another key to winning as well, it can be said that runs scored and runs allowed affect winning percentage independently.

From 1901 to 2019, the runs scored and runs allowed by all teams in MLB did not have a statistically linear relationship with one another (p-value > 0.1) [17]. The regression model of winning percentage using runs scored and allowed was

Wij=β1jRsij+β2jRaij+ɛij,i=1,,119,j=1,,16.

Also, the regression model that indicates the effects that the difference of runs scored and allowed is

Wij=12+α(Rsij-Raij)+ηij,i=1,,119,j=1,,16,

where, ɛij and ηij are error terms that follow normal distribution. In addition, a statistical regression model that is able to estimate the Pythagorean exponent that Bill James [5] suggested is

ln (1-WijWij)=γln (RaijRsij)+ɛij,i=1,,119,j=1,,16.

3. Data Analysis

3.1 Dataset

Since Cincinnati Red Stockings, the first professional baseball team was founded in 1869, a number of baseball teams have been established, and 30 teams are currently in the MLB. The modern form of the major league began in 1901 with the birth of the American League (AL), and in 1903 the world series between the two champion teams took place for the first time [18,19]. In 1969, when the mound was lowered to 10 inches, the MLB was divided into western and eastern regions, and since 1994 the western, central, and eastern regions were reorganized. This study was carried out using the records of 16 teams including 15 teams that have been participating in the MLB since 1901 and the New York Yankees which joined in 1903. The whole dataset used in this paper was offered by MLB.com, Fangraphs, and Baseball-reference.

Table 1 shows the regression coefficients estimated by the least squares method according to the regression model (3), (4), and (5).

Table 1 shows the average runs scored and runs allowed by the MLB teams from 1901 to 2019. The Pythagorean exponent can be calculated using the average of winning percentage, runs scored, and runs allowed. The Pythagorean exponent value (2.039) is calculated using each team’s average runs scored and runs allowed according to Eq. (1). Another method is to calculate the Pythagorean exponent of each team since 1901, subsequently calculate the average value of each team’s Pythagorean exponent, which is as below.

Mpe=1kni=1kj=1nMpe(i,j)=2.061.

i and j indicate team and year respectively, k = 16 and n = 119. The first method resulted in a lower value than the second method.

3.2 Statistical Analysis of Pythagorean Exponent

The variables that explains the winning percentage may be statistically related to each other. First, we examined the linear relationship of the calculated Pythagorean exponents according to each team’s runs scored and runs allowed. The correlation between the Pythagorean exponent calculated using the deterministic model and the one estimated using the stochastic model {(Mpe(j), Spe(j)) : j = 1, ⋯, 16} was not statistically significant (p = 0.32). In addition, the correlation coefficient between the statistically estimated Pythagorean exponent and the regression coefficient of the difference between the runs scored and runs allowed for the winning percentage α̂ was 0.748 (p < 0.05). The correlation coefficient between the numerically calculated Pythagorean exponent and the regression coefficient of scores allowed for the winning percentage was −0.540, which was negatively correlated with each other (p < 0.05).

It is meaningful to compare the regression coefficients of runs scored and runs allowed in the regression model (2) with the Pythagorean exponent. It is to investigate changes in the Pythagorean exponent between teams that are heavily influenced by runs scored or teams that are heavily influenced by runs allowed. In Table 1, the Pythagorean exponent value (2.335) of the teams that were more affected by scoring points (Boston, Cleveland, Dodgers, Minnesota, and St. Louis) was greater than the Pythagorean exponent value (1.896) of the teams that were more affected by losing points. p < 0.05).

The calculated Pythagorean exponent Mpe or the estimated Pythagorean exponent Spe could not be considered statistically significant depending on the team with a high winning percentage (over 0.5) and low percentage (less than 0.5) (p > 0.1). Also, the estimated Pythagorean exponents were compared according to the team with the high number of post-season advances (more than 28 times), the team with the low number of advances (less than 14 times), and the team with the normal number of advances (less than 28 times and more than 15 times). There was no statistically significant difference in the Pythagorean exponent Spe (p > 0.1). However, the results of the median test show that the numerically calculated Pythagorean exponent Mpe differs statistically depending on the number of post-season advances (p = 0.051). The influence of the New York Yankees with more post-season advances than other teams seems to be the major reason. As New York Yankees’ post-season count was apparently higher than other teams in Table 1, the median test, which is a non-parametric estimation method, was used in this study.

This shows that the winning percentage of each team is predictable using its runs scored and runs allowed and the estimated value of the Pythagorean exponent because the statistical Pythagorean exponent of each team is not affected by its winning percentage and number of post-season advances. Meanwhile, the Pythagorean exponent calculated using numerical analysis statistically differs depending on the team’s grade. Therefore, it might be better if the prediction of a team’s winning percentage using a directly calculated Pythagorean exponent is carried out separately depending on each team’s post-season advance.

3.3 Efficiency of Predicted Values

Since Bill James [5] suggested the relationship between the Pytha-gorean exponent and the expectation, it is necessary to compare the winning percentage of each team using the Mpe calculated based on the deterministic model and the Spe estimated based on the stochastic model. If the difference between the estimated winning percentage using the Pythagorean exponent and the actual winning percentage is small, it shows that runs scored, runs allowed, and the Pythagorean exponent can be used to predict the team’s winning percentage.

Efficiency of prediction can be explained with two metrics: (1) using an error between predicted value ŷi and observed value yi, (2) using a correlation coefficient between predicted value ŷi and observed value yi. Mean absolute percentage error (MAPE), which explains the error of predicted value is

Mape=t=1n|y^t-ytyt|×100.

A metric using the linearity between predicted value ŷi and observed value yi is

Corr=t=1n(y^t-y^¯)(yt-y¯)(y^t-y^¯)2(yt-y¯)2,

where, y^¯ and is the average of predicted value ŷi and observed value yi, respectively. As Mape is closer to 0 and Corr is closer to 1, the estimated results are considered efficient. In other words, the estimated result is supposed to have a small error and similar distribution to observed values.

Table 2 shows the efficiency of the predicted values based on the numerical method and the statistical method. Mape that indicates the error between the predicted value and observed value and the statistical method resulted in 4.259 which seems more efficient than numerical method resulting in 5.234. Corr that shows the linearity between the predicted value and observed value did not differ depending on both methods.

The difference of Mape between the predicted values estimated by the deterministic method and the probabilistic method can be seen as the influence of the New York Yankees and St. Louis, which recorded numerous post-season advances.

4. Conclusion

This study investigated the relationship between the actual winning percentage and the Pythagorean exponent proposed by baseball statistician Bill James [5]. From 1901 to 2019, the Pythagorean exponent was determined based on the deterministic and stochastic models based on the runs scored and runs allowed of the 16 teams participating in the MLB. As a result, the value calculated by numerical analysis method Mpe was approximately 2.039, and the one estimated by statistical method Spe using the regression model was 1.854. In addition, the statistical relationship between the Pythagorean exponent, runs scored, runs allowed, and the difference was examined. Statistical analysis showed that the Pythagorean exponent differed depending on the number of post-season advances, and the regression coefficient of the runs scored and runs allowed for the winning percentage was related to the Pythagorean exponent. Furthermore, the result of comparing the difference between the actual winning percentage and the predicted Pythagorean expectation showed that the predicted value estimated by the statistical method resulted in a smaller loss than the predicted value calculated using the deterministic model.

Since this paper demonstrated that the statistically estimated Pythagorean expectation can predict winning percentage, it may be meaningful to compare the results of other prediction methods with that of Pythagorean exponent in future studies.

Table 1 . Coefficient of regression and Pythagorean exponent of each team.

Statistical Pythagorean exponentNumerical Pythagorean exponentPost-season advances
TeamAverage winning percentageAverage runs scoredAverage runs allowedαβ1β2γMpe
Atlanta Braves0.4864.1804.3100.9640.6640.7421.8442.01625
Baltimore Orioles0.4744.3094.5940.9570.7480.9481.8981.90114
Boston Redsox0.5194.6644.4720.9560.820.7981.9032.08524
Chicago Cubs0.5054.3514.2920.9410.650.8451.8081.19720
Chicago Whitesox0.5024.3584.3400.940.8471.0431.7461.8899
Cincinnati Reds0.5004.3044.3260.9110.7410.7881.8592.23915
Cleveland Indians0.5134.5364.4170.930.9640.9161.7881.95814
Detroit Tigers0.5044.6134.5830.9420.7870.9091.8372.19116
LA Dodgers0.5264.3074.0760.9430.9680.81.8231.99133
Minnesota Twins0.4804.3894.5780.9510.8910.8011.8821.86516
New York Yankees0.5704.8644.1730.9641.0470.6531.8832.81455
Oakland Athletics0.4884.4334.5680.9650.7150.7581.9152.47728
Philadelphia Phillies0.4654.2264.5850.9640.7090.9481.8821.92314
Pittsburgh Pirates0.5084.3624.2740.9410.7350.7911.8751.84817
San Francisco Giants0.5344.4694.1490.9390.7650.7931.7921.27626
St. Louis Cardinals0.5214.4424.2490.940.8550.7431.8183.29729

Table 2 . Efficiencies of the predicted values.

MeasureMapeCORR
TeamMsMmCsCm
Atlanta Braves4.5484.7660.9590.959
Baltimore Orioles4.8644.8640.9510.951
Boston Redsox3.6063.7950.9610.961
Chicago Cubs4.4656.0560.9460.945
Chicago Whitesox4.0624.1900.9410.941
Cincinnati Reds4.5035.0240.9230.923
Cleveland Indians4.2964.2690.9280.928
Detroit Tigers4.1664.7160.9430.942
LA Dodgers3.8793.9330.9410.941
Minnesota Twins4.0364.0420.9550.955
New York Yankees3.7177.2650.9430.945
Oakland Athletics4.4146.6690.9700.970
Philadelphia Phillies4.8784.8850.9520.952
Pittsburgh Pirates4.3904.4000.9480.948
San Francisco Giants4.0725.1180.9330.933
St. Louis Cardinals4.0829.7850.9450.946
MLB League4.2805.2360.9460.946

*Ms and Cs mean efficiency of statistical method and Mm and Cm mean efficiency of mathematical calculation.


References

  1. Wikipedia. (2022) . Ball game. Available: https://en.wikipedia.org/wiki/Ball_game
  2. Wikipedia. (2022) . List of racket. Available: https://en.wikipedia.org/wiki/List_of_racket
  3. Liu, H, Gomez, MA, Lago-Penas, C, and Sampaio, J (2015). Match statistics related to winning in the group stage of 2014 Brazil FIFA World Cup. Journal of Sports Sciences. 33, 1205-1213. https://doi.org/10.1080/02640414.2015.1022578
    Pubmed CrossRef
  4. Loughin, TM, and Bargen, JL (2008). Assessing pitcher and catcher influences on base stealing in Major League Baseball. Journal of Sports Sciences. 26, 15-20. https://doi.org/10.1080/02640410701287255
    CrossRef
  5. James, B (19/83). The Bill James Baseball Abstract. New York, NY: Ballantine Books
  6. Miller, SJ (2007). A derivation of the Pythagorean won-loss formula in baseball. Chance. 20, 40-48. https://doi.org/10.1080/09332480.2007.10722831
    CrossRef
  7. Hamilton, HH (2011). An extension of the Pythagorean expectation for association football. Journal of Quantitative Analysis in Sports. 7. article no 2
    CrossRef
  8. Kubatko, J, Oliver, D, Pelton, K, and Rosenbaum, DT (2007). A starting point for analyzing basketball statistics. Journal of Quantitative Analysis in Sports. 3. article no 3
    CrossRef
  9. Cochran, JJ, and Blackstock, R (2009). Pythagoras and the national hockey league. Journal of Quantitative Analysis in Sports. 5. article no 2
    CrossRef
  10. Kovalchik, SA (2016). Is there a Pythagorean theorem for winning in tennis?. Journal of Quantitative Analysis in Sports. 12, 43-49. https://doi.org/10.1515/jqas-2015-0057
    CrossRef
  11. Vine, AJ (2016). Using Pythagorean expectation to determine luck in the KFC Big Bash league. Economic Papers. 35, 269-281. https://doi.org/10.1111/1759-3441.12139
    CrossRef
  12. Tung, DD. (2010) . Confidence intervals for the Pythagorean formula in baseball. Available: https://vixra.org/abs/1005.0020
  13. Pavitt, C (2011). An estimate of how hitting, pitching, fielding, and basestealing impact team winning percentages in baseball. Journal of Quantitative Analysis in Sports. 7. article no 4
    CrossRef
  14. Chen, J, and Li, T (2016). The shrinkage of the Pythagorean exponents. Journal of Sports Analytics. 2, 37-48. https://doi.org/10.3233/JSA-160017
    CrossRef
  15. Palmer, P (2017). Calculating skill and luck in major league baseball. The Baseball Research Journal. 46, 56-61.
  16. Lee, JT (2014). Estimation of exponent value for Pythagorean method in Korean pro-baseball. Journal of the Korean Data and Information Science Society. 25, 493-499. https://doi.org/10.7465/jkdi.2014.25.3.493
    CrossRef
  17. Dayaratna, KD, and Miller, SJ. (2012) . First order approximations of the Pythagorean won-loss formula for predicting MLB teams’ winning percentages. Available: https://arxiv.org/abs/1205.4750
  18. Lee, WJ, Jhang, HJ, Lee, SH, and Choi, SH (2020). A comparison between Arima, Grey and LSTM for the winning percentage in the MLB. Journal of the Korean Institute Intelligent systems. 30, 303-308. https://doi.org/10.5391/JKIIS.2020.30.4.303
    CrossRef
  19. Lee, WJ, Jhang, HJ, Lee, SH, and Choi, SH (2020). Forecasting winning rates in major league baseball based on fuzzy logic. Journal of the Korean Institute Intelligent Systems. 30, 366-372. https://doi.org/10.5391/JKIIS.2020.30.5.366
    CrossRef

Share this article on :

Most KeyWord