Status: finished / Type of Theses: Seminar Theses / Location: Dresden
Byte-pair encoding (Gage, 1994; Sennrich et al., 2015) is a subword tokenization
algorithm popular for its ability to gracefully handle the unknown word problem. It
has seen widespread adoption across various large language models such as BERT
(Devlin et al., 2018) and GPT-2 (Radford et al., 2019). While byte-pair encoding
has been shown to work well for English, it might not be the best choice for
morphologically rich languages like Czech or Finnish. Our work demonstrates how
to integrate existing morphological analyzers into the tokenization process, to bias
byte-pair encoding into better approximating linguistic sub-word boundaries. We
observe significant improvements in evaluation loss and model perplexity, albeit with
slight decreases in accuracy across several downstream tasks.