Datasets and Dictionaries for Crosswords

2022-07-30

Lately, I’ve become worryingly knowledgeable in datasets for crosswords… so I’ve written up basically everything I know that might be helpful to crossword constructors (and makers of other word puzzles, too). However, in writing this, I realized that this may be helpful to just about anybody who works with words — lyricists, poets, marketers, scholars, etc. Hopefully there’s something for everybody! So without further ado,

Dictionaries

I’ll assume you know what a dictionary is — if you’re reading this you may even have a favorite dictionary (or a favorite dictionary edition!), whether it’s Chambers, Merriam-Webster or Google Dictionary (which, fun fact, is mostly sourced from Oxford Languages).

More interesting are dictionaries that allow you to search or query them in more sophisticated ways: the most popular are OneLook and OneLook Thesaurus, where a user can, for example, search bl????rd to find words that start with bl, end with rd, and have four letters in between — so bluebird would be a result.

The main asset with these dictionaries is the expressiveness of the query language, and in that regard Qat (which is also available in French) handily beats OneLook: it can match vowels and consonants (bl@@#@rd) and ranges of letters and lengths (8-10:bl*rd). Qat is also able to solve “word equations” (e.g. ABCDE=.....;!=A<B<C<D<E finds five-letter words whose letters are in strictly alphabetical order, such as abhor and first), and even simultaneous word equations (e.g. ACB;ADB;AEB;|ACB|=5;|E|=1;!=C<D<E finds sets of three five-letter words that are all one letter apart, such as beats, boats, brats — useful for finding crossing words!).

Augmented Dictionaries

Many tools supplement dictionaries with other data, such as etymology, pronunciation or sets of related words. You might think that your favorite dictionary would already give you all of those things, but the strength here is in the ability to easily write very sophisticated queries, such as “what comprises a car that starts with the letter T?”, to give you phrases like trunk, throttle, tailfin, third gear.

The Online Etymology Dictionary looks up word etymologies, which is helpful for avoiding “shared roots” in cryptic crosswords.
The Carnegie Mellon University Pronouncing Dictionary looks up word pronunciations, splitting words up into phonemes. This may seem silly (“can’t you just Google to learn the pronounciation of words?”), but with a bit of work, this dataset lets you look up homophones and Spoonerisms, as some crossword construction software — such as Exet — do!
RhymeZone and its Spanish cousin Rimar.io let you look up homophones, rhymes or near rhymes (RhymeZone actually uses the CMU Pronouncing Dictionary, among other datasets!)
Spruce looks up “inspiring sentences” — quotes, lyrics, proverbs and jokes, which are indexed from WikiQuote and Common Crawl.
Nutrimatic looks up words or phrases mined from Wikipedia. This allows you to, for example, find anagrams that form natural-sounding phrases (e.g. <dictionaries> finds anagrams like is a direction or i consider it a, instead of anagrams that technically work but are not natural-sounding, such as ratio incised or tonic dairies).
The Datamuse API is a very expressive search engine that sits on top of OneLook and RhymeZone. Unfortunately, there isn’t a user-friendly frontend, so it’s effectively restricted to people who are able to make use of programmatic access.

Here, another shoutout goes to OneLook Thesaurus and Qat, which use several datasets (such as the Princeton WordNet and Wikipedia category lists) to search words based on their meaning. For example, in OneLook, process by which plants eat gives you photosynthesis as the top result; in Qat, {hypo:color} gives you words that mean “color”, such as acrylic apricot blacken blueing; also in Qat, {hyper:agate} gives you words that “agate” means, such as entity matter quartz. These searches make it easy to find synonyms, hypernyms, hyponyms and other related words.

Curated Dictionaries

In the other direction are datasets that don’t augment dictionaries, but rather curate them: their usefulness comes not just in what you can find in them, but equally in what you can’t.

The most prevalent examples are wordlists and their cousins, seedlists. As far as I can tell, these are more useful for American-style crosswords, where there is a hard requirement for fully interlocking grids (and grid-filling consequently is a more difficult and computer-assisted task).

Wordlists tend to be personalized by puzzle constructors, and you can find some wordlists for sale, most notably Jeff Chen’s Personal List. There are also several freely-accessible ones such as spread the word(list), The Collaborative Word List, and Peter Broda’s wordlist.

Other examples of curated dictionaries would just be lists of specific things. One amazing example is the Expanded Crossword Name Database, which contains the names of notable women and non-binary people, with an eye to increasing their representation in crosswords. Aside from that, I’ve found Wikipedia’s “listicles” to be very helpful (e.g. here’s a list of notable Native Americans of the United States).

Datasets of Crosswords

Finally, let’s not neglect the most obvious thing: literal datasets of crosswords! These datasets are are significant works of crossword archivism, since acquiring crosswords in bulk and structuring their contents requires effort and cleaning that few are willing to do for such trivial data. (Fun fact: according to this 2004 selection guide, the Library of Congress explicitly does not collect crossword puzzles, suggesting that they’re too trivial for the national library!)

XWord Info is probably the dataset with largest following, as it covers the The New York Times’ crossword and is actively maintained.
Among constructors of American-style crosswords, Matt Ginsberg’s clue dataset is the go-to dataset (since it’s free and accessible to download), but it’s unfortunately no longer actively maintained.
xd.saul.pw is an excellent dataset of American-style crossword and clues from various publications that is also free and accessible to download.
The Cruciverb database is also a dataset of American-style crossword and clues, but unfortunately requires a membership to access.
Finally, to plug my own dataset, cryptics.georgeho.org is a dataset of cryptic clues, with auxiliary datasets of cryptic indicators and charades.

#Crossword #Dataset #Natural-Language-Processing