# Statistical analysis of mixture distribution - Smith A.F.M

ISBN 0-470-90763-4

**Download**(direct link)

**:**

**5**> 6 7 8 9 10 11 .. 103 >> Next

n

L0('J')= n P(*iW) = n 1=1 /= 1

Z njf(x,\Bj)

j= i

(1.2.1)

In some direct applications of finite mixture models, in addition to the random sample from the mixture distribution there may also be random samples available of observations known to derive from individual underlying categories. For example, in studies of fish populations (Example 2.1.2) we may have samples of fish lengths from fish of known age, in addition to a sample of lengths from the mixed population.

If we denote such additional data by

{xjfi. j — 1,..,, /c, h 1,..., Hy},

where at least one of the iij is non-zero, then the overall likelihood provided by

both the uncategorized and categorized observations is given by

Lj(t/0 = L0(\f/) H n fj(xjh\@j)' ( 1.2.2)

J<= l h= i

Moreover, if the categorized observations can be assumed to arise independently, with incidence rates nl,...,nk for the individual categories, then this provides further information about the mixing weights and the appropriate likelihood is

Lz¹=LxW)(\ny. (L2.3)

Statistical analysis of finite mixture distribut

ions

As we shall see later, it is important to clarify which of ( 1.2.1 ), ( 1.2.2), or ( 1.2.3) is appropriate in any particular application. In particular, if information is available about categorized observations it is important to use (1.2.2) rather than

(1.2.1), since the additional information in the former may be substantial. On the other hand, information about categorized observations is often obtained by selecting n, nk,in which case (1.2.3) is not applicable.

We shall adopt Hosmer's (1973a) notation MO, M1 and M2 for the three data structures giving rise to L0, L, and L2, respectively.

In the remaining sections, we shall briefly indicate some of the general statistical problems that arise in the context of finite mixture models. These, of course, are determined in any particular context by the form of application and our degrees of ignorance about the various features of the model.

1.2.2 The number of components

In many contexts, uncertainty about the number of components leads to statistical problems closely related to cluster analysis, possibly with strong assumptions about parametric structure. For example, it is often assumed that the densities are normal (univariate or multivariate, as appropriate). In other cases, a considerable amount of data is available from the mixture, so that, in effect, we ‘know' the form of p(x) in (1.1.1). Given parametric assumptions about the underlying components, the problem then becomes one of curve fitting.

Sometimes, we are concerned to find the mixture with the fewest components that still provides a satisfactory fit to the data. In particular, we commonly wish to know whether to assume that there are two underlying components or just a single component. Given a particular, proposed mixture model with assumed

parametric forms for the component densities, a hypothesized mixture having

fewer components may be regarded as the imposition of a ‘null hypothesis’ on the original model framework. The problem of comparing the two mixture models may therefore seem to be that of testing between nested hypotheses.

However, it soon becomes clear that traditional testing recipes are not so easily applied in this context. Consider, for example, the apparently straightforward problem of testing between the following hypotheses:

H0: p(x) = cp(x | p, a),

Hi: pix) = n<J)(x\pl,(jl) + (\ -Tc)0(x|/i2,(T2),

w here xeR, and the parameters under both H0 and Hj are assumed unknown. In other w ords, we are simply asking whether there is a single normal component, or two.

If we were thinking in traditional terms, it would be natural—particularly if a large sample were available to consider applying the generalized likelihood ratio test, referring, for significance, to a table of the Xr distribution, where r is equal to the number of constraints imposed on H, in order to produce H0-I his leads to immediate difficulties, since there is not a unique way of obtaining H0 from H,! We could impose

Introduction

5

{)r n = 0 (1 constraint)

Hi = = °2 (2 constraints).

What are the appropriate degrees of freedom for the chi-squared distribution? Or perhaps—as we shall see in Section 5.4—we should not be trying to apply the traditional procedure at all.

1.2.3 Unknown parameters

Assuming a given values of k—even if only as a provisional step in an analysis pertaining to questions raised in Section 1.2.2- we shall need, in a parametric formulation, to make inferences about the unknown parameters of the mixture model. Several different cases arise.

In some direct applications, it is possible to carry out detailed studies of the individual component distributions separately from the mixture problem. Thus, for example, the forms of grain size distributions for various individual minerals can be established in the laboratory as a preliminary to the analysis of an ash deposit composed of a mixture of minerals (Example 2.1.1). In the context of

**5**> 6 7 8 9 10 11 .. 103 >> Next