Entropy and Perplexity
Entropy
Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable is defined as:
is the probability distribution of . The self-information of is defined as , which measures how much information is gained when occurs. The negative sign indicates that as the occurrence of increases, its self-information value decreases.
Entropy has several properties, including:
- It is non-negative: .
- It is at its minimum when is entirely predictable (all probability mass on a single outcome).
- It is at its maximum when all outcomes of are equally likely.
Q10: Why is logarithmic scale used to measure self-information in entropy calculations?
Sequence Entropy
Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.
Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:
The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :
In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:
The Shannon-McMillan-Breiman theorem implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.
☝️ The bigram model in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.
By applying this theorem, can be approximated:
Consequently, is approximated as follows, where :
Q11: What indicates high entropy in a text corpus?
Perplexity
Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:
Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .
Perplexity, , can be directly derived from the approximated entropy rate, :
Q12: What is the relationship between corpus entropy and language model perplexity?
References
- Entropy, Wikipedia
- Perplexity, Wikipedia