Writing a chess app (2/9): Parsing pgn files
As learned in part 1 of this series a chess game can be represented via the Portable Game Notation. It is the de-facto standard to represent a chess game. So when building a chess application we need to deal with this file format and find a performant way to extract information out of it. One would think that there exists a ton of high-quality open-source packages to deal with this topic. Unfortunately I could find only a handful of really robust, performant solutions and none of them are built in Swift. I probably could have taken a Rust implementation or something similiar, compile a static library and write some shims around the API but this time I felt like it would be a fun challenge to come up with my own solution.
The problem with a lot of the existing implementations is that they either make extensive use of regex which hurts the performance quite a bit or they do not implement the full PGN specification. For example I only found a few implementations which actually dealt with comments in PGN files. Comments however are a pretty big part of my repertoires since they help me not only memorizing moves but also to understand them better. I briefly tried some parser generators based on some PGN grammar I found online but at least the Swift generators were too slow for my needs which is why I ultimately decided to built a custom PGN parser from the ground up.
PGN Specification
We already looked at PGN files on a high-level in part 1 of this blog post series. Now lets examine it a bit closer.
A PGN file typically consists of two main parts:
-
Tags: These are metadata that provide information about the game.
-
Moves: The actual moves made during the game.
Tags
Tags are enclosed in square brackets and provide various details about the game.
Example:
-
[Event "Event Name"]
: Name of the tournament or match. -
[Site "Location"]
: Where the game took place. -
[Date "YYYY.MM.DD"]
: Date of the game. -
[Round "Round Number"]
: Round in the tournament. -
[White "Player Name"]
: Name of the player with the white pieces. -
[Black "Player Name"]
: Name of the player with the black pieces. -
[Result "Result"]
: Outcome of the game (e.g., "1-0", "0-1", "1/2-1/2").
There is a list of common tag names but potentially you can add any tag you like. The seven tags above are standard for a valid PGN file and are called the Seven Tag Roaster.
Moves
The moves of the game follow the tags and are recorded in standard algebraic notation. Each move is separated by a space, and turns are indicated by numbering:
-
Move Number: Each full turn (White and Black) is numbered.
-
Move Notation: Moves are represented using standard algebraic notation
Example:
Results
At the end of the moves, the result is indicated using the Result tag. The possible results are:
-
"1-0"
: White wins -
"0-1"
: Black wins -
"1/2-1/2"
: Draw -
"*"
: Unknown/Ongoing
Comments
Comments are inserted by either a ;
(a comment that continues to the end of the line) or a {
(which continues until a }
). Comments do not nest.
Example:
Variations
A variation is a sequence of movetext containing one or more alternative moves enclosed in parentheses. The alternate move sequence given by a variation is one that may be legally played by first unplaying the move that appears immediately prior to the parantheses. Because a variation is a recursive construct, it may be nested.
Example:
Multiple Games
A PGN file can contain multiple games, separated by an empty line.
Encoding
PGN files are typically encoded in UTF-8, which allows for the inclusion of special characters.
Buffered reading
PGN files can get quite large, or at least large enough to consider parser performance and the overall memory footprint when dealing with these files. This lead me to a rabbit hole researching various buffered reading solutions available for the iOS and macOS platforms. This is one area where I found various options which probably all will work for the given task. The most promising solution I found is to use BufferedReader
from swift-nio API Documentation. However at the end I did not want to include such a big dependency like swift-nio in my package. Therefore I opted for implementing a simple RingBuffer data structure with the help of TPCircularBuffer in combination with FileHandle.read(upToCount:)
API Documentation to solve the task at hand. In the following I'll present you a trimmed down version of my solution:
With this in place we can assure that the reader itself does not allocate (besides a single fixed-size buffer) which keeps the memory footprint of our solution to a minimum.
Visitor pattern
The Visitor Pattern is a design pattern that allows you to separate an algorithm from the object structure on which it operates. This is particularly useful in the context of a parser implementation, where you may want to perform various operations on a complex data structure without modifying the structure itself.
I chose this approach due to the following key points which I think are pretty valuable in terms of performance, flexibility and ease of use:
-
The visitor can decide if and how to represent games in memory.
-
The reader does not validate move legality. This allows implementing support for custom chess variants, or delaying move validation.
For reference here is the protocol I built to represent a visitor for the PGN parser implementation:
Implementing a parser
So now that we have a good understanding about the PGN file format, can read PGN files quite efficiently in terms of memory consumption and have a visitor interface in place to transform the parsed data all that is left is actually implementing the parsing logic to tell a concrete visitor instance about the parsed information.
Benchmarks
One burning question remains unanswered though. How fast is it? To answer this question I ran some benchmarks on my implementation.
The benchmark was run with a PGN file I downloaded from the Lichess database (~ 1,000,000 games, ~ 1 GB uncompressed) on an Apple M1 MacBook Pro with 32 GB RAM.
My simple benchmark visitor counts up statistics on the PGN file (how many games are available in this file, etc.):
I then used the swift-benchmark package to actually execute my test.
Here are the results:
-
Time: 28,19s
-
Throughput: 38,12 MB/s
To my eye it seems like there still is room for improvement comparing against highly optimized solutions build in different languages. However parsing 1 GB of chess games in about half a minute while being able to support the full specification is more than fine for my use case. My typical chess dataset is a couple of megabytes in size. At a later point I might want to revisit this implementation though and see if I can find the culprit slowing it down. For the moment I am more than happy with the results. What do you think?
Posted in swift