# Statistical analysis of mixture distribution - Smith A.F.M

ISBN 0-470-90763-4

**Download**(direct link)

**:**

**66**> 67 68 69 70 71 72 .. 103 >> Next

P(x,c) = p(x\c,0)p(c\n). (5.7.3)

Thus the prior probabilities of class membership and the conditional sampling densities of .v are modelled. The marginal density for x is of the mixture form and the probabilities of interest, from the point of view of discriminant analysis, are obtained, by Bayes’ theorem, as

p(c\x) = p(x\c,0)p(c\7r)/YJp(x\c'.0)p(c'\/T), c= 1 k.

c'

(b) Diagnostic Paradigm

In contrast to (a), we model the factors of (5.7.2):

p(x,c) = p(c\x,fi)p(x\y). (5.7.4)

A popular example is the linear logistic model, in which

p(c | x,p )/p(k | x,fl) = exp (fijx), c = 1,.,.,/c— 1.

Of course, given a parametric model for (5.7.3), there is an equivalent one for (5.7.4) with /fund yas functions of (0,/r). Generally, however, the two approaches produce different models for p(x,c). Typically, in the sampling paradigm, 0and n are chosen to be distinct parameters and, in the diagnostic paradigm, pand y arc distinct. Sometimes y is not considered at all. It is undoubtedly true that, so far, the sampling paradigm has been extremely popular. However, Dawid (1976) argues that, in some contexts, it may be more realistic to model the factors in the diagnostic paradigm. Often the results achieved by corresponding versions of the two paradigms are quite similar (Efron, 1975), but there is a disturbing qualitative difference between the two paradigms so far as the usefulness of uncategorized

cases in discrimination is concerned. We shall examine this problem using a

likelihood-based approach.

If the diagnostic paradigm is used, with pand y distinct, then the uncategorized data {x,} contribute factors {p(xt\ y)}. They do not give us any information at all about p, and therefore about {p(c\x,P)}. On the other hand, if the sampling paradigm is used, and is the correct model, the uncategorized data provide information about the mixture model for p(x). If the mixture is identifiable, and we have a very large amount of data available, then we are close to knowing the true mixture density and, therefore, the true ‘optimal’ likelihood ratio discriminant rules! This happy eventuality occurs even in the extreme case of MO data, in

which there is no fully categorized observation!

We are left in practice with a real dilemma. If we adopt the diagnostic paradigm when it is false, we shall lose valuable information: if we wrongly use the sampling paradigm, we are unnecessarily incorporating useless data.

170

Statistical analysis of finite mixture distributions

Usually, the two sets of 'distinct' parameterizations are not equivalent. An exception is the case where the sample space for the feature variable is multinomial. In this case, the complete-data sample space is that of a k x / contingency table, where k is the number of classes and / is the number of cells in the feature variable's multinomial distribution. The cell probabilities {0j(l} have the two equivalent parameterizations defined as in Example 4.3.8, one representing each paradigm. Whichever paradigm is used for maximum likelihood estimation (the diagnostic paradigm is by far the easier, as indicated earlier), the discriminatory information is concentrated in the fully categorized data. Although we have talked in terms of 'parameterizations’ for this example, it is worth bearing in mind the fact that this model can be regarded as the non-parametric model for discrete data and this is the real reason for the equivalence.

If the sampling paradigm is appropriate, so that the underlying model for the uncategorized cases is a parametric mixture model, then, given identifiability of the mixture, it does appear to be worth while to use a discriminant rule which includes them. We shall consider in detail the familiar problem in which there are two classes and, within class j,

x ~ I), j = 1,2,

where d denotes the dimensionality of the feature vector, x. The marginal distribution of x is therefore that of a mixture of the two multivariate normal distributions.

With the assumption of equal covariance matrices and given mixing weights, or incidence rates, tt,, n2, the rule for assigning an unconfirmed feature vector y is based on the linear discriminant function (LDF)

L( y) = SJy + fi, (5.7.5)

where 6 = l-\n2-nx),

P = \og{n2/n1)-\6T{/i2+fi1)

(cf. Example 4.3.4).

If L(y)>0, y is assigned to the second category and, otherwise, to the first.

In practice, of course, the parameters are unknown. Usually, estimates are substituted for 7ij, 7r2, fix, fi2 and Z and the resulting discriminant function

Uy) = STy + li (5.7.6)

is used in the same manner as Lfy).

For any assignment rule, its efficiency may be judged on the basis of the consequent expected error rate—the expected value of the probability of misclassifying a random’ categorized case. If all parameters are known and the LDF (5.7.5) is used, this probability is easy to evaluate and is given by

tt.CK — {A + A) + Tr2<I>( - lA - A), (5.7.7)

tthere A = log(n2/7ij) and A2 =(fii — ti2)J2,~ l(pti — fi2), the squared Mahala-

**66**> 67 68 69 70 71 72 .. 103 >> Next