Trígrafos, a game for word nerds
My formative internet years, well over a decade ago, were with a social community called "Nerd Paradise." We played all sorts of web-based games: e-mail chess, IRC Countdown, Rubik's Cube races, codes and ciphers of all stripes, and an Oulipo speech-and-debate league (where you're not allowed to use the letter e). Good times. I had completely forgotten about one word game until last week. It doesn't really have a name, this game, but the premise is simple: someone submits an uncommon three-letter sequence, and the first person to find a word containing that sequence wins the round.
Example: Player A suggests the sequence fta. Players B, C, D rack their brains; Player B shouts "chieftain!" and wins the round for being fastest. Player B then suggests a new sequence. She chooses vra. Unfortunately, Player B is from the U.K., and the other players are American, and none of them think of a word with that sequence (like manoeuvrable). Game over.
Immediately, I wanted two things: to play this game again, and to figure out what it's called. I just started calling it trígrafos, since I was planning to use the game as a Spanish vocab-expander. I also added a sliding-scale for challenge—if you go on a winning streak, you'll encounter rarer trigraphs.
Play the Game
Head over to Github and download the game! That page also has the instructions and to-do notes. Play it to your heart's content and let me know what you think.
Making the dictionaries and challenge levels
This was a fun little jaunt into corpus linguistics. For the English, French, and Italian versions, I used the Google N-Grams database, which is based on hundreds of thousands of digitized books going back centuries (for Spanish and Portuguese I used different but similar databases). Each language database consists of several gigabytes of tables, listing every word in the given language, which years that word was seen in, and how many instances of that word appear in that year. Lots of words, lots of years: these tables are BIG.
From these tables, I needed to extract two things: a frequency-sorted list of unigrams (every single word in the language), and a frequency-sorted list of trigraphs (every three-letter sequence in words in the language). The corpus processing is rudimentary, but effective: unigrams or trigraphs that appear less than a certain number of times are removed from the lists. Then the trigraphs are separated into strata, with the most common ones occupying Levels 1, 2, 3...each level filled with increasingly infrequent trigraphs until Level 20, filled with the rarest, most obscure trigraphs.