3.2.4. Discrete Models

3. Production Process Characterization
3.2. Assumptions / Prerequisites

3.2.4. Discrete Models

Description

There are many instances when we are faced with the analysis of discrete data rather than continuous data. Examples of this are yield (good/bad), speed bins (slow/fast/faster/fastest), survey results (favor/oppose), etc. We then try to explain the discrete outcomes with some combination of discrete and/or continuous explanatory variables. In this situation the modeling techniques we have learned so far (CLM and ANOVA) are no longer appropriate.

Contingency table analysis and log-linear model

There are two primary methods available for the analysis of discrete response data. The first one applies to situations in which we have discrete explanatory variables and discrete responses and is known as Contingency Table Analysis. The model for this is covered in detail in this section. The second model applies when we have both discrete and continuous explanatory variables and is referred to as a Log-Linear Model. That model is beyond the scope of this Handbook, but interested readers should refer to the reference section of this chapter for a list of useful books on the topic.

Model

Suppose we have n individuals that we classify according to two criteria, A and B. Suppose there are r levels of criterion A and s levels of criterion B. These responses can be displayed in an r x s table. For example, suppose we have a box of manufactured parts that we classify as good or bad and whether they came from supplier 1, 2 or 3.

Now, each cell of this table will have a count of the individuals who fall into its particular combination of classification levels. Let's call this count N_ij. The sum of all of these counts will be equal to the total number of individuals, N. Also, each row of the table will sum to N_i. and each column will sum to N_.j .

Under the assumption that there is no interaction between the two classifying variables (like the number of good or bad parts does not depend on which supplier they came from), we can calculate the counts we would expect to see in each cell. Let's call the expected count for any cell E_ij . Then the expected value for a cell is E_ij = N_i. * N_.j /N . All we need to do then is to compare the expected counts to the observed counts. If there is a consderable difference between the observed counts and the expected values, then the two variables interact in some way.

Estimation

The estimation is very simple. All we do is make a table of the observed counts and then calculate the expected counts as described above.

Testing

The test is performed using a Chi-Square goodness-of-fit test according to the following formula:

\( \chi^2 = \sum{\sum{\frac{(\mbox{observed} - \mbox{expected})^2} {\mbox{expected}}}} \)

where the summation is across all of the cells in the table.

Given the assumptions stated below, this statistic has pproximately a chi-square distribution and is therefore compared against a chi-square table with (r-1)(s-1) degrees of freedom, with r and s as previously defined. If the value of the test statistic is less than the chi-square value for a given level of confidence, then the classifying variables are declared independent, otherwise they are judged to be dependent.

Assumptions

The estimation and testing results above hold regardless of whether the sample model is Poisson, multinomial, or product-multinomial. The chi-square results start to break down if the counts in any cell are small, say < 5.

Uses

The contingency table method is really just a test of interaction between discrete explanatory variables for discrete responses. The example given below is for two factors. The methods are equally applicable to more factors, but as with any interaction, as you add more factors the interpretation of the results becomes more difficult.

Example

Suppose we are comparing the yield from two manufacturing processes. We want want to know if one process has a higher yield.

Make table of counts

Table 1. Yields for two production processes
	Good	Bad	Totals
Process A	86	14	100
Process B	80	20	100
Totals	166	34	200

We obtain the expected values by the formula given above. This gives the table below.

Calculate expected counts

Table 2. Expected values for two production processes
	Good	Bad	Totals
Process A	83	17	100
Process B	83	17	100
Totals	166	34	200

Calculate chi-square statistic and compare to table value

The chi-square statistic is 1.276. This is below the chi-square value for 1 degree of freedom and 90% confidence of 2.71 . Therefore, we conclude that there is not a (significant) difference in process yield.

Conclusion

Therefore, we conclude that there is no statistically significant difference between the two processes.