The Zipf distribution is a probability distribution used primarily for text frequency, where it models the frequency of words in a large body of text being proportional to their rank (i.e., its frequency rank).
The probability mass function is given by:
where is the number of distinct words, , is the number of occurrences of the word, and is a normalisation constant (the th harmonic mean), given by:
The Zipf random variable has a property such that few outcomes (words) occur frequently, but most outcomes occur rarely. It finds use in studies on the Internet and interconnectivity.
Computations
The expected value is given by:
The second moment is given by:
And variance: