We want to avoid garbage in, garbage out when training NLP models. So we need to do some pre-processing to ensure basic quality of the data we work with.

Stop words are words that are usually filtered out immediately because they don’t provide much value to our analysis (i.e., of, the, to, have). No reason to have them!

We also observe that with word conjugation and forms (plurals, past, present tense), words often have the same value to our analysis. In this case, we want to have a single representation in our corpus. This process is called lemmatisation.