Should we sort files before or after merging them?

This weekend I produced a very large database of chess positions. This database is useful to chess engine programmers and researcher. The database is a text file containing one position per line encoded using the Forsyth-Edwards Notation.

At some point in the process of creating this file I had multiple files containing more than 600 million lines with lots of duplicates (positions occurring in more than one game). I needed to use the GNU sort utility to merge those files and remove duplicates. I had two options : I could join the files in one single large file and sort it, or I could sort the files individually and use the sort utility to merge them together and remove duplicates at the same time. After some research on the internet I could not find any advice about which route might be the best. So I decided to test it.

I’ve taken a more manageable subset of my file containing 9.5 million lines, I shuffled it with :

sort --randomized-sort -o testdata testdata

Then I splited this file into roughly 10 equals parts :

split -n l/10 -d testdata testdata

I’ve run the first scenario (join the files and sort the result) four times with the command below. The average time for the four run is 69.4 seconds with the slowest run taking 70.9 seconds.

cat testdata* > testdata; sort -o testdata testdata

I’ve run the second scenario with the small bash script below. The four run average is 70.9 with the slowest run at 71.1.

for f in testdata*
 sort -o $f $f
sort --merge -o testdata testdata*

While the second method is marginally slower it’s not really significant. It kinds of make sense. GNU sort implements a merge sort which is already a recursive sort. By sorting the individual files before hands we are doing exactly what GNU sort is doing internaly, that is merge sorted chunks of the data.


FEN Database

Back in 2003, I collaborated with a friend to generate a collection of chess positions in FEN format. Back then we were creating this collection to test the efficiency of the Zobrist hash function for chess positions. Since we only needed the pieces positions on the board we only kept the first field of the FEN notation.

In 2005 I mentioned on that I had such a database and since then a couple of persons contacted me by email asking me for a copy of the database. Unfortunately, most of them could not make use of it because important information about the positions (such as whose side is to play next) were missing.

This week, someone contacted me again asking for the database and I decided I would build it again, this time keeping all FEN fields.

The database is created using all the PGN files I have collected over time. I also updated my TWIC collection up to last week before I started. I used the pgn2fen.exe tool available on Tim Foden’s site to create huge files containing every FEN positions of every games in my PGN files. I then used the GNU sort utility to sort the files and remove any duplicate lines. That gave me a file with 286,623,962 unique positions, weighting 16.8 gb. I used the GNU split tool to split this file in five more « manageable » files. I used the round robin distribution capabilities of split so that each file would contains position representative of the whole collection even if they are sorted. Lastly, I compressed the files with 7za, because it was the most effective and free compression method I found in my (limited) researches. The files went from 3.4 gb to 0.65 gb each for a total of 3.3 gb.

You can dowload each separate files below. On linux you can expand them using :

7za e uniqueXX.fen.7z

On Windows you will need to use 7zip wich is free and has a graphical interface.