⁂ George Ho

Datasets and Dictionaries for Crosswords

Lately, I’ve become worryingly knowledgeable in datasets for crosswords… so I’ve written up basically everything I know that might be helpful to crossword constructors (and makers of other word puzzles, too). However, in writing this, I realized that this may be helpful to just about anybody who works with words — lyricists, poets, marketers, scholars, etc. Hopefully there’s something for everybody! So without further ado,


I’ll assume you know what a dictionary is — if you’re reading this you may even have a favorite dictionary (or a favorite dictionary edition!), whether it’s Chambers, Merriam-Webster or Google Dictionary (which, fun fact, is mostly sourced from Oxford Languages).

More interesting are dictionaries that allow you to search or query them in more sophisticated ways: the most popular are OneLook and OneLook Thesaurus, where a user can, for example, search bl????rd to find words that start with bl, end with rd, and have four letters in between — so bluebird would be a result.

The main asset with these dictionaries is the expressiveness of the query language, and in that regard Qat (which is also available in French) handily beats OneLook: it can match vowels and consonants (bl@@#@rd) and ranges of letters and lengths (8-10:bl*rd). Qat is also able to solve “word equations” (e.g. ABCDE=.....;!=A<B<C<D<E finds five-letter words whose letters are in strictly alphabetical order, such as abhor and first), and even simultaneous word equations (e.g. ACB;ADB;AEB;|ACB|=5;|E|=1;!=C<D<E finds sets of three five-letter words that are all one letter apart, such as beats, boats, brats — useful for finding crossing words!).

Augmented Dictionaries

Many tools supplement dictionaries with other data, such as etymology, pronunciation or sets of related words. You might think that your favorite dictionary would already give you all of those things, but the strength here is in the ability to easily write very sophisticated queries, such as “what comprises a car that starts with the letter T?”, to give you phrases like trunk, throttle, tailfin, third gear.

Here, another shoutout goes to OneLook Thesaurus and Qat, which use several datasets (such as the Princeton WordNet and Wikipedia category lists) to search words based on their meaning. For example, in OneLook, process by which plants eat gives you photosynthesis as the top result; in Qat, {hypo:color} gives you words that mean “color”, such as acrylic apricot blacken blueing; also in Qat, {hyper:agate} gives you words that “agate” means, such as entity matter quartz. These searches make it easy to find synonyms, hypernyms, hyponyms and other related words.

Curated Dictionaries

In the other direction are datasets that don’t augment dictionaries, but rather curate them: their usefulness comes not just in what you can find in them, but equally in what you can’t.

The most prevalent examples are wordlists and their cousins, seedlists. As far as I can tell, these are more useful for American-style crosswords, where there is a hard requirement for fully interlocking grids (and grid-filling consequently is a more difficult and computer-assisted task).

Wordlists tend to be personalized by puzzle constructors, and you can find some wordlists for sale, most notably Jeff Chen’s Personal List. There are also several freely-accessible ones such as spread the word(list), The Collaborative Word List, and Peter Broda’s wordlist.

Other examples of curated dictionaries would just be lists of specific things. One amazing example is the Expanded Crossword Name Database, which contains the names of notable women and non-binary people, with an eye to increasing their representation in crosswords. Aside from that, I’ve found Wikipedia’s “listicles” to be very helpful (e.g. here’s a list of notable Native Americans of the United States).

Datasets of Crosswords

Finally, let’s not neglect the most obvious thing: literal datasets of crosswords! These datasets are are significant works of crossword archivism, since acquiring crosswords in bulk and structuring their contents requires effort and cleaning that few are willing to do for such trivial data. (Fun fact: according to this 2004 selection guide, the Library of Congress explicitly does not collect crossword puzzles, suggesting that they’re too trivial for the national library!)

#Crossword #Dataset #Natural-Language-Processing