Have you ever felt like your chess rating doesn't represent your actual playing strength? Sometimes we want to be able to estimate playing strength based on individual games rather than rating (which changes more slowly).
During the past few months, I've been taking a number of online courses and learning python for data analysis. In one of the courses, the final project allowed me to choose my own dataset. So surprise surprise! I chose something chess related. (Not really surprised, are you?)
When we play games online, getting a computer evaluation is just a few clicks away. And a commonly used statistic is the average centipawn loss, or simply the average deviation from the computer's best move. Many of us tend to think that centipawn loss (CPL) is a good estimate of playing strength. And, of course, it gives some indication, but it's far from a perfect predictor.
Fellow chess/statistics blogger Patrick Coulombe has investigated the correlation between rating and CPL and concluded that the correlation is not very strong. I therefore concluded that other factors need to be taken into consideration when trying to estimate playing strength.
My initial plan was to download all the games played during April from the lichess database, but when I realized that the file was about 160GB, I changed my mind. I chose a smaller dataset, and built my analysis on about 5000 games from the lichess yearly classical arena, played two weeks ago. The advantage of choosing this as a dataset is that all the games have the same time control. Sure, using millions of games would have been fun, but the amount of data would just be too impractical for a normal laptop computer.
A simple plot of rating vs CPL produces a similar result as Patrick found in his analysis. However, the large number of datapoints makes a normal scatterplot difficult to read, so I chose a different kind of plot.
In this plot, the shading indicates the "concentration" of data points. A darker color means more games. The plot has a blob-like shape, which suggests that the correlation between the variables is not very strong. But there is a clear orientation to the plot, and the green line indicates the main relationship between rating and CPL. I was of course tempted to use the slope of this line to try to predict playing strength, and at the end of this post, you can see how that turned out.
Another attempt at understanding the data is to add the opponent's rating to the analysis. In the diagram below, the players' ratings are given on each axis, and the average CPL is indicated by colors. Just a reminder, an average CPL of 300 means that a player, on average, blunders the equivalent of a piece on every move.
As the diagram shows, the red end of the color spectrum is concentrated around the lower rating levels, and the darker shades of blue are mostly found at the higher rating levels. However, there are many red spots scattered around the entire plot, which shows that even strong players can make horrible blunders.
Another statistic that could be a predictor, is the blunder rate. In this case, I have defined a blunder as a move that gives a CPL of 150 (1.5 pawns) or more. I have counted the number of blunders and number of moves, and the blunder rate is simply the average number of blunders per move.
As you can see from the plot, the scale goes up to 0.5, which means that every other move is a blunder. Here, we see a slightly different picture. Strong players are almost exclusively in the blue zone, which indicates blunder rates of 10-20%. Players below 1500 are mostly in the yellow and red parts.
This reminds me of a quote from Garry Kasparov:
RatingDiff is the difference in rating between players, nmoves is the number of moves, and nblunders is the number of blunders. This means that 1655 is a baseline and for each move that is played, your estimated strength increases with roughly 8 points, and for each blunder it drops by 22 points.
I tested this model on a number of my own games, and found that it is fairly good (from a statistical point of view).
This diagram below shows how the rating varies in my own games (observed), in the regression estimate and in the prediction based on the green line in the first diagram (see above). The boxes indicate where the majority of games are located.
We can see that the regression estimate gives a somewhat higher result compared to my actual ratings, but approximately the same variation. However, the estimates that are based on CPL alone gives quite extreme values, which suggests that it has very poor accuracy.
So the model has an acceptable accuracy, but there is a downside: The unexplained variation is so large that the estimate from one game has an uncertainty of +/- 400 rating points. This makes the estimate quite useless for individual games. A larger sample will improve the precision, but in order to reduce the uncertainty to +/- 50 rating points, you need about 40 games. From a statistical point of view, this is not problematic, but from a practical point of view, this would be rather pointless. Over 40 games, your rating would adjust properly, and you'll have a good estimate of playing strength right there.
So to round off this long and complicated post, I have come to the conclusion that estimating playing strength from game statistics is possible, but not very useful.
During the past few months, I've been taking a number of online courses and learning python for data analysis. In one of the courses, the final project allowed me to choose my own dataset. So surprise surprise! I chose something chess related. (Not really surprised, are you?)
When we play games online, getting a computer evaluation is just a few clicks away. And a commonly used statistic is the average centipawn loss, or simply the average deviation from the computer's best move. Many of us tend to think that centipawn loss (CPL) is a good estimate of playing strength. And, of course, it gives some indication, but it's far from a perfect predictor.
My initial plan was to download all the games played during April from the lichess database, but when I realized that the file was about 160GB, I changed my mind. I chose a smaller dataset, and built my analysis on about 5000 games from the lichess yearly classical arena, played two weeks ago. The advantage of choosing this as a dataset is that all the games have the same time control. Sure, using millions of games would have been fun, but the amount of data would just be too impractical for a normal laptop computer.
A simple plot of rating vs CPL produces a similar result as Patrick found in his analysis. However, the large number of datapoints makes a normal scatterplot difficult to read, so I chose a different kind of plot.
Another attempt at understanding the data is to add the opponent's rating to the analysis. In the diagram below, the players' ratings are given on each axis, and the average CPL is indicated by colors. Just a reminder, an average CPL of 300 means that a player, on average, blunders the equivalent of a piece on every move.
Another statistic that could be a predictor, is the blunder rate. In this case, I have defined a blunder as a move that gives a CPL of 150 (1.5 pawns) or more. I have counted the number of blunders and number of moves, and the blunder rate is simply the average number of blunders per move.
Masters blunder three times per game,amateurs blunder three times per move
In the final part of my project, I did a multiple regression analysis to see how well the playing strength can be predicted with more variables. I won't go into details here, but the final formula is as follows:
Rating = 1655 - 0.20*CPL -0.45*RatingDiff + 8.55*nmoves -22*nblunders
I tested this model on a number of my own games, and found that it is fairly good (from a statistical point of view).
This diagram below shows how the rating varies in my own games (observed), in the regression estimate and in the prediction based on the green line in the first diagram (see above). The boxes indicate where the majority of games are located.
So the model has an acceptable accuracy, but there is a downside: The unexplained variation is so large that the estimate from one game has an uncertainty of +/- 400 rating points. This makes the estimate quite useless for individual games. A larger sample will improve the precision, but in order to reduce the uncertainty to +/- 50 rating points, you need about 40 games. From a statistical point of view, this is not problematic, but from a practical point of view, this would be rather pointless. Over 40 games, your rating would adjust properly, and you'll have a good estimate of playing strength right there.
So to round off this long and complicated post, I have come to the conclusion that estimating playing strength from game statistics is possible, but not very useful.
Comments
Post a Comment