7.2.6.2. Percentiles

7. Product and Process Comparisons
7.2. Comparisons based on data from one process
7.2.6. What intervals contain a fixed percentage of the population values?

7.2.6.2. Percentiles

Definitions of order statistics and ranks

For a series of measurements $Y_1, \, \ldots, \, Y_N$, denote the data ordered in increasing order of magnitude by $Y_{[1]}, \, \ldots, \, Y_{[N]}$. These ordered data are called order statistics. If $Y_{[j]}$ is the order statistic that corresponds to the measurement $Y_i$, then the rank for $Y_i$ is $j$; i.e., $$ Y_{[j]} \sim Y_i \,\, \Longrightarrow \,\, r_i = j \, .$$

Definition of percentiles

Order statistics provide a way of estimating proportions of the data that should fall above and below a given value, called a percentile. The $p$th percentile is a value, $Y_{(p)}$, such that at most $(100 p)$ % of the measurements are less than this value and at most $100(1-p)$ % are greater. The 50th percentile is called the median.

Percentiles split a set of ordered data into hundredths. (Deciles split ordered data into tenths). For example, 70 % of the data should fall below the 70th percentile.

Given n points, the percentile corresponding to the i-th point is

$ \frac{i}{n+1} $ More typically we start with a desired percentile value and this percentile of interest may not correspond to a specific data point. In this case, interpolation between points is required. There is not a standard univerally accepted way to perform this interpolation. After describing our default method, several alternative methods are given. All of the methods discussed here are used in practice.

Estimation of percentiles

Percentiles can be estimated from $N$ measurements as follows: for the $p$th percentile, set $p(N+1)$ equal to $k + d$ for $k$ an integer, and $d$, a fraction greater than or equal to 0 and less than 1.

For $0 \lt k \lt N, \,\,\,\,\, Y_{(p)} = Y_{[k]} + d \left( Y_{[k+1]} - Y{[k]} \right)$
For $k = 0, \,\,\,\,\, Y_{(p)} = Y_{[1]}$
Note that any p ≤ 1/(N+1) will simply be set to the minimum value.
For $k ≥ N, \,\,\,\,\, Y_{(p)} = Y{[N]}$
Note that any p ≥ N/(N+1) will simply be set to the maximum value.

Example and interpretation

For the purpose of illustration, twelve measurements from a gage study are shown below. The measurements are resistivities of silicon wafers measured in ohm^.cm.

       i  Measurements  Order stats   Ranks

       1     95.1772     95.0610        9
       2     95.1567     95.0925        6
       3     95.1937     95.1065       10
       4     95.1959     95.1195       11
       5     95.1442     95.1442        5
       6     95.0610     95.1567        1
       7     95.1591     95.1591        7
       8     95.1195     95.1682        4
       9     95.1065     95.1772        3
      10     95.0925     95.1937        2
      11     95.1990     95.1959       12
      12     95.1682     95.1990        8

To find the 90th percentile, $p(N+1)$ = 0.9(13) = 11.7; $k$ = 11, and $d$ = 0.7. From condition (1) above, $Y_{(90)}$ is estimated to be 95.1981 ohm^.cm. This percentile, although it is an estimate from a small sample of resistivities measurements, gives an indication of the percentile for a population of resistivity measurements.

Note that there are other ways of calculating percentiles in common use

Hyndman and Fan (1996) in an American Statistician article evaluated nine different methods (we will refer to these as R1 through R9) for computing percentiles relative to six desirable properties. Their goal was to advocate a "standard" definition for percentiles that would be implemented in statistical software. Although this has not in fact happened, the article does provide a useful summary and evaluation of various methods for computing percentiles. Most statistical and spreadsheet software use one of the methods described in Hyndman and Fan.

The method described above corresponds to method R6 of Hyndman and Fan. This is the default method used by Dataplot.

The method advocated by Hyndman and Fan is R8. For the R8 method, set $ p(N+(1/3)) + (1/3) $ and proceed as above. Note that any p ≤ (2/3)/(N+(1/3)) will be set to the minimum value and any p ≥ (N-(1/3))/(N+(1/3)) will be set to the maximum value. Both R and Dataplot can optionally use this method. For the example given above, R8 gives 95.1972 (compared to 95.1981) for the 90-th percentile.

Some software packages set $1 + p(N-1)$ equal to $k + d$ and then proceed as above. This is method R7 of Hyndman and Fan. This is the method used by Excel and is the default method for R (the R quantile function can optionally use any of the nine methods discussed in Hyndman and Fan). For the example given above, R7 gives 95.1957.

The R6, R7, and R8 methods give fairly similar, but not exactly the same (particularly for small samples), results. For most purposes, any of these three methods should be acceptable.

Another method of calculating percentiles (given in some elementary textbooks) starts by calculating $p N$. If that is not an integer, round up to the next highest integer $k$ and use $Y_{[k]}$ as the percentile estimate. If $p N$ is an integer $k$, use $ 0.5 \left( Y_{[k]} + Y_{[k+1]} \right) $. One of R6, R7, or R8 would typically be preferred to this method.

Definition of Tolerance Interval

An interval covering population percentiles can be interpreted as "covering a proportion $p$ of the population with a level of confidence, say, 90 %." This is known as a tolerance interval.