Dick Pfander, the collector of NBA box scores whom I wrote about this week, was integral to getting decades of statistics from the league's early years into Sports-Reference.com's database. So, too, was a man with little interest in the NBA but a knack for typing — really, really fast typing.
After Pfander sold his trove of data to Sports Reference, the company had to figure out who would enter the data from the images of box scores it had acquired into its database. Who better than Sean Wrona, a champion speed typist and sports-data enthusiast who maintains his own auto-racing statistical site?
To find out what it takes to enter historical sports data, and what you can learn by doing so, I interviewed Wrona by email. Wrona's experience is also a reminder that the data we use comes from somewhere, and it's often imperfect. Here's an edited transcript of his responses:
My experience with my auto-racing statistical archive race-database.com gave me abundant archival experience that I could apply to this. I had plenty of experience dealing with erroneous data sets, trying to guess the appropriate and most logical corrections that needed to be made to make the data as accurate as possible, and so on.
I definitely picked up speed as I went along, too, both with my own archival work and my Sports Reference work. I entered 32 then-NASCAR Grand National (now NASCAR Sprint Cup) races in a single day once.
While my typing speed gives me a big advantage, this has minimal similarity to competitive typing. In competitive typing, you don't have to check or verify anything. Sometimes on some sites you will occasionally find quotes with incorrect spelling or grammar, so you have to adjust your correct instincts as you go, but that's very rare. At my typing speed, competitive typing is a thoughtless pursuit.
The priority with archival work is to go as quickly as you can while minimizing errors.
Auto racing and college basketball are the only sports I've seriously followed, but I know enough about the core statistics of most sports so that I can understand a box score enough to archive it properly.
I entered the complete box-score results from the 1979-80 to 1984-85 NBA seasons in reverse chronological order as a series of thousands of SQL queries from February 2012 to March 2013.
It took about eight minutes to enter a box score and two minutes to enter the team-level data for the later seasons that had more information. But for the earlier seasons, which had a lot less information available, I could do it as quickly as four minutes per box score and one minute per team-level entry.
Strangely, the earlier box scores seemed to be more legible than the later ones in the early '80s. The later ones tended to have tons of smudges that obscured the data, while the earlier ones read much more clearly.
Primarily I checked to make sure that individual players' field goals made/attempted, free throws made/attempted and point totals added up to the overall team scores. Usually I subtracted each individual's total from the overall team total and if the result differed from zero, I checked and usually double-checked by adding all the totals instead of subtracting to make sure I hadn't made an arithmetic error.
The box scores themselves had far more errors than I made — as on average maybe twice a week, there was a box score where the data didn't properly add up — but I never tabulated an error rate (it might have taken me nearly as long as entering the games did!). This is understandable, because record-keeping was a lot more difficult in the '70s before personal computers became ubiquitous. I'm also very good at feeling when I've made a typo and can correct it on the fly.
I certainly noticed games that went into overtime, mainly because they kept my game-level data from displaying in a well-aligned fashion. I did notice some noteworthy stat lines where players had very high scores, rebound totals, etc.
I came to have a deep respect for Bernard King, a player I had actually never heard of when starting this project. His Basketball Hall of Fame induction was clearly overdue, because he was dominant for a fairly long period despite not being a household name on the level of the Dream Team members. His peak years happened to perfectly coincide with the years I archived, but still …
You'll definitely learn some stuff when you archive, especially if you know very little. I certainly have learned a lot more about auto racing through my own site. However, I think with my applied statistics work at Cornell and my extensive archival experience, it isn't necessary to have a deep knowledge to be a good archivist.