# Statistical analysis of mixture distribution - Smith A.F.M

ISBN 0-470-90763-4

**Download**(direct link)

**:**

**44**> 45 46 47 48 49 50 .. 103 >> Next

A 1 =

I

n

"I,,

0

1..

C, =(72/

n't

K

2 0

0 t]

0] = (/i, A)

0-, =

Ao

(c) Inflated-variance model

With the same prior specification for n as in the previous models, the inflated

variance model of Box and Tiao (1968) corresponds, in the notation of (4.4.6), to

0

110

Statistical analysis of finite mixture distributions

A , = C,=(T

o I„-r 0

A2 = l. C2 = k'2(t2, 02 = po.

0 0 X2L-

If we impose no constraints on the possible number of outliers in models such as these the number of possible identifications of subsets of xt,..., x„ as outliers may prove computationally prohibitive. In practice, however, it might be more realistic to allow for up to only 10 or 20 per cent of the sample to be outliers and therefore to limit the number of possible identifications.

If we denote the class of possible identifications of subsets of observations as

outliers that we wish to consider by s = 1 M < 2", then overall inference for

unknown parameters is based on

p(\J/|x)= ? p(i/^|x,Js)p(Js|x). (4.4.7)

5=1

In this mixture posterior form, p{if/\x, Js) represents the inference to be made about i{/, were it to be assumed that a particular subset of the observations (as defined by Js) are outliers; the weight factor p{Js|x) represents the posterior plausibility of the assumed identification Js. If kr denotes the set of Js which specify a total of r outliers, then

p(r outliers|x) = ]T p(JJx), r = 0,1,...,R, (4.4.8)

J gGkr

forms the basis for a Bayesian assessment of the existence and number of outliers (assuming an upper bound of R). Plots of joint or marginal densities based on p(i//\x,Js) for various s provide insights into the sensitivity of conclusions to assumptions about numbers of outliers.

The detailed forms of p(^|x, Js) and p(Js|x) are easily derived using general results for (4.4.6) given by Lindley and Smith (1972). If Clt C2 are known, the posterior probability for an Js specified in terms of appropriate choices of Ah C„ i = 1,2, is proportional to the prior probability multiplied by

1C, +AlC2A]\-l'2exp[-\(x-AlA202)T(Cl + AXC2A\)-I(x-AxA202)-].

(4.4.9)

Alternatively, if, conditional on a2, C, = <r2/n, C2 = o2V, with V known, and the prior for the unknown o2 is specified in the form vk/o2 ~ (4.4.9) is replaced by

|/„ + A, VA]\- 1/2[vA + (x - /M202)T(/„ + Al VA\)~ *(x - AlA202)y{n + v)l2.

(4.4.10)

II interest centres on inference for 0,, we note (see, for example, Lindley and Smith, 1972) that the distribution of 0,, given x, CUC2, is N{Bb,B), where

B-l = A\Ci'Ax+C2\ b = A\C^xx + C2l A202.

Learning about the parameters of a mixture I (j

Under the alternative specification for unknown o2 given above, the posterior distribution for 0, is Student-r with degrees of freedom n + v and mean and dispersion matrix given by

{A\A, + K-1)-1(/4jX+ V~l A202)

and

[A\Ai + V~'r\ respectively.

Overall inference for 0, thus has the form of a weighted average of such normal or r-densities, with weights given by (4.4.9) or (4.4.10).

Example 4.4.1 Outlier analysis of Darwin's data

Abraham and Box (1978) present an analysis of Darwin’s data on the differences in heights between fifteen pairs of self-fertilized and cross-fertilized plants grown in the same conditions. The ordered values are:

- 67, - 48, 6, 8, 14, 16, 23, 24, 28, 29, 41, 49, 56, 60, 75.

Using a simple location-shift model and assuming that zr = 0.95 in (4.4.5), Abraham and Box calculate p{JJx) for the set of n identifications which specify for each observation in turn that it is the only outlier. They thus obtain the set of posterior weights

wt = p(Xf is the outlier| x, 1 outlier),

as shown in Figure 4.4.1.

Figure 4.4.2 shows plots of various posterior densities for p (the location parameter for the differenced data) corresponding to different assumptions about values of k and numbers of outliers. Curve A is the density obtained when all the observations are assumed to be from the normal component <j>(x\p,o)\ curve B is the density obtained by forming a suitable weighted combination of the p(p\x,Js) densities assuming exactly two outliers, so that

w

0 15

0 .10

0.05

k

pb¹) = X I X pW*)-

J.ekj I Jrfki

-50

1111 n ii i u i J—L

o

50

Observotlon

Figure 4.4.1 Weights >v for the plant heights data corresponding to tt = 0.95. Reproduced by permission of the Royal Statistical Society from Abraham and Box (1978)

I!2

Statistical analysis of finite mixture distributions

No outlier assumed Two outliers assumed 7r = 0.99 7T = 0 95 7T = 0 85

P W*1

Figure 4.4.2 Fitted densities for plant height data. Reproduced by permission of the Royal Statistical Society from Abraham and Box (1978)

The remaining curves are the overall posterior densities obtained for the indicated values of 71, assuming up to three outliers.

Curve A and the overall density corresponding to n = 0.99 are based on rather strong prior assumptions about the implausibility of outliers in the data. The other three curves, however, correspond to more open-minded prior assumptions about outliers and tend to convey similar messages (all having modes, for example, around 34). Taken in conjunction with Figure 4.4.1 (and more extensive calculations given in Abraham and Box), the Bayesian analysis of such mixture models for outliers provides a rich range of summary posterior inferences, both overall and conditional on any particular assumptions of interest.

**44**> 45 46 47 48 49 50 .. 103 >> Next