Home // Morphologically Biased Byte-Pair Encoding

Supervisor

Author

Jonas Knobloch

Morphologically Biased Byte-Pair Encoding

Status: finished / Type of Theses: Seminar Theses / Location: Dresden

Byte-pair encoding (Gage, 1994; Sennrich et al., 2015) is a subword tokenization
algorithm popular for its ability to gracefully handle the unknown word problem. It
has seen widespread adoption across various large language models such as BERT
(Devlin et al., 2018) and GPT-2 (Radford et al., 2019). While byte-pair encoding
has been shown to work well for English, it might not be the best choice for
morphologically rich languages like Czech or Finnish. Our work demonstrates how
to integrate existing morphological analyzers into the tokenization process, to bias
byte-pair encoding into better approximating linguistic sub-word boundaries. We
observe significant improvements in evaluation loss and model perplexity, albeit with
slight decreases in accuracy across several downstream tasks.

funded by:

Gefördert vom Bundesministerium für Bildung und Forschung.

ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.