Jonathan Bratt - {morphemepiece}: more meaningful tokenization for NLP
{morphemepiece}: more meaningful tokenization for NLP by Jonathan Bratt Visit https://rstats.ai/nyr/ to learn more. Abstract: Modern language models use tokenizers based on subword-level vocabularies. Words not present in the vocabulary are broken into subword tokens. This subword tokenization is generally unrelated to the morphological structure of the word. It is intuitively appealing to consider a tokenizer that uses a morpheme-level vocabulary to split words into meaningful units. Implementing such a tokenizer, while conceptually straightforward, presents a number of practical challenges. We present an approach to solving these challenges and introduce {morphemepiece}, an R package that implements a new tokenization algorithm for breaking down (most) words into their smallest units of meaning. Bio: Jonathan gets excited about physics, education, and natural language processing. He is a data scientist on Macmillan Learning’s newly-formed Content Science team, where he tries as much as possible to work at the intersection of these interests. Twitter: https://twitter.com/ Presented at the 2021 New York R Conference (September 9, 2021)