In natural language processing, it’s important for our model’s quality to pre-process the text it works on, much like regular data cleaning practices in machine learning. We want to avoid garbage in, garbage out when training NLP models. So we need to do some pre-processing to ensure the basic quality of the data we work with.

A lot of these seem non-obvious. But it’s important to do because certain words have functionally the same value to our analysis as each other (like test versus Test). We don’t want to treat these differently.

A non-exhaustive list of standard steps:

  • Normalisation — convert text to lowercase.
  • Stop word filtering — we filter out words that don’t provide much value to our analysis (like of, the, to, have). No point in keeping them!
  • Lemmatisation — converts word conjugations and forms (plurals, past, present tense) to a single base representation. Note that this process checks if the word exists in a dictionary already.
    • Stemmers — provide similar functionality as lemmatisers, but blindly removes common suffixes. This may be unintended behaviour and result in syntactically invalid language tokens.