I first learned entropy in physics, which in thermodynamics, measures energy dispersal at a specific temperature. Later I learned about the term information entropy in communication. And then in machine learning, it is widely used as a representation of uncertainty. Here we are talking about Shannon entropy, whose definition is − ∑ i = 1 K p i log 2 p i -\sum_{i=1}^K p_i\log_2 p_i −∑i=1Kpilog2pi. There are other measures of uncertainty, but for some reason, people choose Shannon entropy more often for its good properties [1].
The following are mainly summarized and extended from [1].
This can be proved using Weighted AM–GM inequality [2]: w 1 x 1 + w 2 x 2 + ⋯ + w n x n w ≥ x 1 w 1 x 2 w 2 ⋯ x n w n w {\frac {w_{1}x_{1}+w_{2}x_{2}+\cdots +w_{n}x_{n}}{w}}\geq {\sqrt[ {w}]{x_{1}^{{w_{1}}}x_{2}^{{w_{2}}}\cdots x_{n}^{{w_{n}}}}} ww1x1+w2x2+⋯+wnxn≥wx1w1x2w2⋯xnwn by letting w i w = p i \frac{w_i}{w}=p_i wwi=pi, x i = 1 p i x_i=\frac{1}{p_i} xi=pi1.
To formulate this property in math equations, we have H ( X , Y ) = H ( X ) + H ( Y ) H(X,Y) = H(X) + H(Y) H(X,Y)=H(X)+H(Y), if X ⊥ Y X\perp Y X⊥Y
Another function − ∑ i = 1 K p i 2 -\sum_{i=1}^K p_i^2 −∑i=1Kpi2, which satisfies the first property, does not satisfy this one. That’s why trace of covariance as a representation of uncertainty may not be as good as entropy.
H ( p 1 , p 2 , … , p n ) = H ( p 1 , p 2 , … , p n , p n + 1 = 0 ) H(p_1,p_2,\dots,p_n) = H(p_1,p_2,\dots,p_n,p_{n+1}=0) H(p1,p2,…,pn)=H(p1,p2,…,pn,pn+1=0)
Some other measurements also satisfies this property, such as trace of covariance matrix, determinant of covariance matrix.
Note: there is a Uniqueness Theorem [1]
Khinchin (1957) showed that the only family of functions satisfying the four basic properties described above are of the following form: H ( p 1 , p 2 , … , p K ) = − λ ∑ i = 1 K p i log 2 p i H(p_1,p_2,\dots,p_K)=-\lambda\sum_{i=1}^K p_i\log_2 p_i H(p1,p2,…,pK)=−λ∑i=1Kpilog2pi Functions that satisfy the 4 basic properties where λ \lambda λ is a positive constant. Khinchin referred to this as the Uniqueness Theorem. Setting λ = 1 \lambda = 1 λ=1 and using the binary logarithm gives us the Shannon entropy. To reiterate, entropy is used because it has desirable properties and is the natural choice among the family functions that satisfy all items on the basic wish list (properties 1–4).
Besides the above discussion for the basic 4 properties of entropy, there are some other interesting facts about entropy that I will explore later.
[1] Entropy is a measure of uncertainty, Sebastian Kwiatkowski [2] Inequality of arithmetic and geometric means, Wikipedia