ispell nouns

2023-05-06 python ispell itertools fileinput re named group regex

I am working on a small Hangman game that allows to guess words. Part of the problem is to find good czech language dictionary that I can use as an input. Searching of the internet for text-based dictionaries I found ispell Czech dictionary that looks good for my purpose.

Since I am trying to get more orientation in python world, I tried to use it to find suitable words. My primary requirements were:

take nouns
longer than 8 characters
keep accents

The ispell format is pretty simple, just plain text in utf-8 encoding. On each line, there can be number of words separated by space, optionally followed by slash and some flags that denote the suffix of the word. In some cases, there is also way to expand number of prefixes for the word. For example

aeroplán/H
aféra/ZQ
afinita/ZQ
{a,in,post,pre,su}fix/H
{a,bezpre,in,post,pre,su}fixový/YKR
aforismus/Q
aforistický/YCRN
afrikáta/ZQ
...
baba/ZQ babi
{,pra,prapra}bába/ZQ bábi
{,pra,prapra}bábin/Y
{,pra,prapra}babiččin/Y
{,pra,prapra}babička/ZQ
babin/Y
babí/Y
babizna/ZQ

In the script I tried to use number of techniques. Core of the functionality is generator function ispell_entries that parses ispell lines and lazily produces entries as tuple of the word and its flags. On top of building the generator with yield command, I also tried regular expressions including the verbose syntax that allows to make the patterns more readable. Another nice thing is named capture groups with (?P<name> ... ) syntax.

The entries are then filtered through longer_noun function that allows only flags I am interested in and longer words. Final part is printing only first few hundreds of words via islice function from itertools

import fileinput
from itertools import islice
import re

def ispell_entries(file):
    for line in file:
        # each line can contain multiple entries - "blána/Z blanou blan blanám blanách blanami"
        for entry in line.rstrip().split(' '):
            word, sep, flags = entry.partition('/')

            # handle format like {a,in,post,pre,su}fix
            match = re.search(r' \{ (?P<prefix> .*? ) \} (?P<rest> .* )$', word, re.VERBOSE)
            if match:
                for prefix in match.group('prefix').split(','):
                    yield prefix + match.group('rest'), flags
            else:
                yield word, flags

def longer_noun(entry):
    word, flags = entry
    if not re.search(r'[HQXZPI]', flags):
        return False
    return len(word) >= 8

input = fileinput.input(encoding="utf-8")
for entry in islice(filter(longer_noun, ispell_entries(input)), 200):
    print(entry[0])

The script accepts .cat files on command-line, here is the output I got from using hlavni.cat input file

abdikace
abnormalita
absentér
absentismus
absolutismus
abstinence
abstrakce
absurdita
acetylén
adaptace
adjektivum
administrativa
administrátor
admiralita
viceadmirál
adresátka
advokacie
...