Dataplot Vol 1 Vol 2

# PROBABILITY PLOT

Name:
... PROBABILITY PLOT
Type:
Graphics Command
Purpose:
Generates a probability plot for one of 90+ distributions.
Description:
A probability plot is a graphical data analysis technique for determining how well the specified distribution fits the data set. Linearity in the probability plot is indicative of a good distributional fit.

The probability plot consists of:

 Vertical axis = ordered observations; Horizontal axis = percent point function of the order statistic medians.

This is essentially a plot of the data percentiles versus the percentiles of the theoretical distribution. Dataplot computes the percent point function of the uniform order statistic medians to compute the percentiles of the theoretical distribution.

DATAPLOT has extensive probability plot capabilities (90+ distributions/distributional families are available). When distributional families are specified, the LET command is used before the PROBABILITY PLOT command to specify which member of the distributional family is desired. For example,

LET GAMMA = 5.3
WEIBULL PROBABILITY PLOT Y

The name of the distributional parameter for families is given in the list below.

Probability plots serve two primary uses.

1. Distributional Modeling

The slope and intercept of the line fit to the probability plot are estimates for the location and scale parameters of the distribution.

The following provides one possible approach to distributional modeling.

• If the distribution has one or two shape parameters, use the PPCC PLOT or KS PLOT to obtain estimates for the shape parameters (HELP PPCC PLOT or HELP KS PLOT for details).

• Once the shape parameters (if any) have been estimated, generate the probability plot to obtain estimates for the location and scale parameters.

• The bootstrap can be used to obtain confidence intervals for the distribution parameters and selected quantiles. Enter HELP DISTRIBUTIONAL BOOTSTRAP for details.

2. Goodness of Fit

The probability plot provides a graphical assessment of goodness of fit. The straighter the probability plot, the better the fit. One advantage of the graphical approach over quantitative measures (e.g., Kolmogorov-Smirnov test) is that it provides an indication of how the distribution is not a good fit. This can provide guidance to a better distributional model.

The correlation coefficient of the points on the probability plot provides a numerical measure of the straightness of the probability plot. Dataplot automatically saves this value in the internal parameter PPCC. The PPCC values provide a useful ranking measure when comparing different distributional models.

Syntax 1:
<dist> PROBABILITY PLOT <y>
<SUBSET/EXCEPT/FOR/qualification>
where <y> is the variable of raw data values under analysis;
<dist> is one of the following distributions:

1. NORMAL
2. HALFNORMAL
3. SLASH
4. COSINE
5. LOGISTIC
6. HALF LOGISTIC
7. HYPERBOLIC SECANT
8. CAUCHY
9. HALF CAUCHY
10. DOUBLE EXPONENTIAL
11. EXPONENTIAL
12. EXTREME VALUE TYPE 1 (or GUMBEL)
13. UNIFORM
14. SEMI-CIRCULAR
15. ANGLIT
16. ARCSIN
17. RAYLEIGH
18. MAXWELL

19. WEIBULL    (GAMMA)
20. DOUBLE WEIBULL    (GAMMA)
21. INVERTED WEIBULL    (GAMMA)
22. GAMMA    (GAMMA)
23. LOG GAMMA    (GAMMA)
24. DOUBLE GAMMA    (GAMMA)
25. INVERTED GAMMA    (GAMMA)
26. WALD    (GAMMA)
27. FATIGUE LIFE    (GAMMA)
28. EXTREME VALUE TYPE 2    (GAMMA)
29. GENERALIZED EXTREME VALUE    (GAMMA)
30. PARETO    (GAMMA)
31. )PARETO SECOND KIND    (GAMMA)
32. GENERALIZED PARETO    (GAMMA)
33. GENERALIZED HALF LOGISTIC    (GAMMA)
34. TUKEY LAMBDA    (LAMBDA)
35. SKEWED NORMAL    (LAMBDA)
36. SKEW DOUBLE EXPONENTIAL    (LAMBDA)
37. POISSON    (LAMBDA)
38. T    (NU)
39. FOLDED T    (NU)
40. CHI-SQUARED    (NU)
41. CHI    (NU)
42. LOGNORMAL    (SD)
43. LOG DOUBLE EXPONENTIAL    (ALPHA)
44. ERROR    (ALPHA)
45. GENERALIZED LOGISTIC    (ALPHA)
46. WRAPPED CAUCHY    (C)
47. POWER FUNCTION    (C)
48. TRIANGULAR    (C)
49. LOG LOGISTIC    (DELTA)
50. VON-MISES    (B)
51. DISCRETE UNIFORM    (N)
52. GEOMETRIC    (P)
53. YULE    (P)
54. LOGARITHMIC SERIES    (THETA)
55. RECIPROCAL    (B)
57. ASYMMETRIC DOUBLE EXPO    (K)

58. POWER-NORMAL    (P, SD)
59. POWER-LOGNORMAL    (P, SD)
60. FOLDED NORMAL    (M, SD)
61. FOLDED CAUCHY    (M, SD)
62. SKEWED T    (LAMBDA, NU)
63. NONCENTRAL T    (NU, LAMBDA)
64. NONCENTRAL CHISQUARE    (NU, LAMBDA)
65. LOG SKEWED NORMAL    (LAMBDA)
66. BETA    (ALPHA, BETA)
67. INVERTED BETA    (ALPHA, BETA)
68. BETA BINOMIAL    (ALPHA, BETA)
69. HERMITE    (ALPHA, BETA)
70. EXPONENTIAL POWER    (ALPHA, BETA)
71. ALPHA    (ALPHA, BETA)
72. G AND H    (G, H)
73. JOHNSON SB    (ALPHA1, ALPHA2)
74. JOHNSON SU    (ALPHA1, ALPHA2)
75. EXPONENTIATED WEIBULL    (GAMMA, THETA)
76. GENERALIZED GAMMA    (GAMMA, C)
77. INVERSE GAUSSIAN    (GAMMA, MU)
78. RECIPROCAL INVERSE GAUSSIAN (GAMMA, MU)
79. F    (NU1, NU2)
80. TWO-SIDED POWER    (THETA, N)
81. BINOMIAL    (N, P)
82. GOMPERTZ    (C, B)
83. GENERALIZED MCLEISH    (ALPHA, A)
84. NEGATIVE BINOMIAL    (K, P)

85. LOG SKEWED T    (LAMBDA, NU, SD)
86. DOUBLY NONCENTRAL T    (NU, LAMBDA1, LAMBDA2)
87. NONCENTRAL F    (NU1, NU2, LAMBDA)
88. NONCENTRAL BETA    (ALPHA, BETA, LAMBDA)
89. TRUNCATED EXPONENTIAL    (X0, M, SD)
90. GENERALIZED EXPONENTIAL    (LAMBDA1, LAMBDA2, S)
91. GOMPERTZ-MAKEHAM    (XI, LAMBDA, THETA)
92. MIELKE BETA-KAPPA    (BETA, THETA, K)
93. HYPERGEOMETRIC    (K, N, M)
94. GENERALIZED INVERSE GAUSS    (CHI, LAMBDA, THETA)
95. BESSEL I    (SIGMA1SQ, SIGMA2SQ, NU)

(B, C, M)

96. DOUBLY NONCENTRAL F    (NU1, NU2, LAMBDA1, LAMBDA2)
97. TRUNCATED NORMAL    (A, B, M, SD)
98. TRAPEZOID    (A, B, C, D)

99. NORMAL MIXTURE    (MU1, SD1, MU2, SD2, P)
100. BI-WEIBULL    (SCALE1, GAMMA1, LOC2, SCALE2, GAMMA2)
101. GENERALZIED TRAPEZOID    (A, B, C, D, NU1, NU3, ALPHA)

and where the is optional.

This syntax is used for the case where we have raw data.

Syntax 2:
<dist> CENSORED PROBABILITY PLOT PLOT <y> <x>
<SUBSET/EXCEPT/FOR/qualification>
where <y> is the variable of raw data values under analysis;
<x> is the censoring variable;
<dist> is one of the distributions listed above;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax is used for the case where we have censored data. A value of 1 indicates a failure time and a value of 0 indicates a censoring time.

Censoring is not supported for discrete distributions or grouped data.

Syntax 3:
<dist> PROBABILITY PLOT PLOT <y> <x>
<SUBSET/EXCEPT/FOR/qualification>
where <y> is the variable of pre-computed frequencies;
<x> is the variable of distinct values for the variable under analysis;
<dist> is one of the distributions listed above; and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax is used for the case where we have frequency (binned) data. The bins are defined by their mid-points.

Syntax 4:
<dist> PROBABILITY PLOT PLOT <y> <xlow> <xhigh>
<SUBSET/EXCEPT/FOR/qualification>
where <y> is the variable of pre-computed frequencies;
<x> is the variable of distinct values for the variable under analysis;
<dist> is one of the distributions listed above;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax is used for the case where we have frequency (binned) data. The bins are defined by their lower and upper limits. This syntax allows bins with unequal widths.

Examples:
NORMAL PROBABILITY PLOT X
CAUCHY PROBABILITY PLOT X
TUKEY LAMBDA PROBABILITY PLOT X
LOGNORMAL PROBABILITY PLOT X
WEIBULL PROBABILITY PLOT X
EXTREME VALUE TYPE 1 PROBABILITY PLOT X
POISSON PROBABILITY PLOT X
NORMAL PROBABILITY PLOT F X
CAUCHY PROBABILITY PLOT F X
TUKEY LAMBDA PROBABILITY PLOT F X
LOGNORMAL PROBABILITY PLOT F X
WEIBULL PROBABILITY PLOT F X
EXTREME VALUE TYPE 1 PROBABILITY PLOT F X
POISSON PROBABILITY PLOT F X
Note:
The PROBABILITY PLOT command automatically saves the following parameters:

• PPCC - the correlation coeffcient of the points on the probability plot
• PPA0 - the intercept of the line fitted to the probability plot (estimate of the location parameter)
• PPA1 - the slope of the line fitted to the probability plot (estimate of the scale parameter)
• SDPPA0 - the standard deviation of PPA0
• SDPPA1 - the standard deviation of PPA1
• PPRESSD - the residual standard deviation of the line fitted to the probability plot
• PPRESDF - the residual degrees of freedom of the line fitted to the probability plot
• PPA0BW - the intercept of the line fitted to the probability plot with biweight weighting of the residuals
• PPA1BW - the slope of the line fitted to the probability plot with biweight weighting of the residuals

The PPCC value provides a measure of the linearity of the probability plot.

The PPA0 and PPA1 provides estimates of the location and scale parameters.

For some distributions with heavy tails (e.g., Cauchy, slash), there can be extreme variability in the first few and last few points in the probability plot. This can distort the estimates of location and scale. Two iterations of biweight weighting of the residuals are applied to obtain PPA0BW and PPA1BW. In most cases, using PPA0 and PPA1 are preferred. However, in cases where there is extreme non-linearity in the tails, using PPA0BW and PPA1BW may be preferred as the location and scale estimates.

Note:
For uncensored data, Dataplot uses the uniform order statistic medians to determine the plotting positions. This needs to be modified somewhat for censored data.

For singly censored data (i.e., all the censored data have the same censoring time), we can use the N from the full sample to compute the uniform order statistics. However, we only plot the failure times.

An alternative that works with both singly and multiply (the censoring times are not necessarily the same) is to base the plotting positions on the Kaplan-Meier statistic. That is,

$$p_{i} = \frac{n + 0.7}{n + 0.4} \prod_{k=1}^{i}{\frac{n - k + 0.7}{n - k + 1.7}}$$

with n denoting the full sample size. Again, only plotting positiions corresponding to failure times are plotted. The percent point function is computed on the pi values.

This method for censored probability plots is discussed in more detail on pp. 43-46 of the Bury book (see the References section below).

To specify which method to use, enter the command

SET CENSORED PROBABILITY PLOT
<KAPLAN-MEIER/UNIFORM ORDER STATISTIC MEDIANS>
Note:
An alternative to binning data is to use the command

SET PROBABILITY PLOT DATA POINTS <value>

When this command is entered, Dataplot will compute <value> equally spaced percentiles and compute the probability plot on these percentiles. This option can be useful when generating probability plots on large data sets for distributions with expensive percent point functions.

Note:
For discrete distributions, the data will typicall consist of integers. In this case, it is helpful to group the data based on these integer values. The following code shows the recommended way for doing this:

LET YLOW = MINIMUM Y
LET YUPP = MAXIMUM Y
LET YLOW = YLOW - 0.5
CLASS LOWER YLOW
LET YUPP = YUPP + 0.5
CLASS UPPER YUPP
CLASS WIDTH = 1
LET Y2 X2 = BINNED Y
LET LAMBDA = 4.2
POISSON PROBABILITY PLOT Y2 X2

This will center the bins around the integer values and will cover the first and last class.

Default:
None
Synonyms:
EV2 and FRECHET are synonyms for EXTREME VALUE TYPE 2.
EV1 and GUMBEL are synonyms for EXTREME VALUE TYPE 1.
FATIGUE LIFE is a synonym for FL.
RECIPROCAL INVERSE GAUSSIAN is a synonym for RIG.
IG is a synonym for INVERSE GAUSSIAN.
SKEW LAPLACE is a synonym for SKEW DOUBLE EXPONENTIAL
ASYMMETRIC LAPLACE is a synonym for ASYMMETRIC DOUBLE EXPONENTIAL
Related Commands:
 FREQUENCY PLOT = Generates a frequency plot. HISTOGRAM = Generates a histogram. PIE CHART = Generates a pie chart. PERCENT POINT PLOT = Generates a percent point plot. PPCC PLOT = Generates probability plot correlation coefficient plot. PLOT = Generates a data or function plot.
References:
James J. Filliben (1975), "The Probability Plot Correlation Coefficient Test for Normality", Technometrics, Vol. 17, No. 1.

Chambers, Cleveland, Kleiner, and Tukey (1983), "Graphical Methods of Data Analysis", Wadsworth.

Karl Bury (1999), "Statistical Distributions in Engineering", Cambridge University Press,

Applications:
Distributional Analysis
Implementation Date:
Pre-1987: Original implementation
1990/5: WALD, FL, RIG, INVERSE GAUSSIAN
1993/12: GENERALIZED PARETO
1994/9: DISCRETE UNIFORM, NON-CENTRAL T, NON-CENTRAL F,
NON-CENTRAL CHI-SQUARE, NON-CENTRAL BETA, DOUBLY NON-CENTRAL T, DOUBLY NON-CENTRAL F, HYPERGEOMETRIC
1994/10: VON-MISES
1995/5: POWER LOGNORMAL, POWER NORMAL, COSINE,
ALPHA, POWER FUNCTION, CHI, LOGARITHMIC SERIES, LOG LOGISTIC, GENERALIZED GAMMA, WARING
1995/9: ANGLIT, ARCSIN, FOLDED NORMAL, TRUNCATED NORMAL
1995/10: LOG GAMMA, HYPERBOLIC SECANT, GOMPERTZ
1995/12: PARETO SECOND KIND, DOUBLE WEIBULL,
WRAPPED UP CAUCHY, EXPONENTIATED WEIBULL, TRUNCATED EXPONENTIAL GENERALIZED LOGISTIC, EXPONENTIAL POWER
1996/1: DOUBLE GAMMA, BETA-KAPPA, FOLDED CAUCHY
1996/5: BETA BINOMIAL, GENERALIZED EXPONENTIAL
1998/5: RECIPROCAL, NORMAL MIXTURE, INVERTED GAMMA
2001/10: GENERALIZED LAMBDA, JOHNSON SU,
JOHNSON SB, INVERTED WEIBULL, LOG DOUBLE EXPONENTIAL
2002/5: TWO-SIDED POWER, BI-WEIBULL
2003/5: ERROR
2004/1: TRAPEZOID, GENERALIZED TRAPEZOID, FOLDED T,
SKEWED T, SKEWED NORMAL, SLASH, INVERTED BETA, G AND H
2004/5: Implemented the automatic computation of the
biweight fit (PPA0BW and PPA1BW)
2004/5: LOG SKEW NORMAL, LOG SKEW T, HERMITE, YULE
2004/5: Fixed a number of bugs for various distributions
2004/6: SKEW DOUBLE EXPONENTIAL, RAYLEIGH,
ASYMMETRIC DOUBLE EXPONENTIAL, MAXWELL,
2004/8: Meeker reparameterization of GOMPERTZ MAKEHAM,
GENERALIZED ASYMMETRIC LAPLACE, GENERALIZED INVERSE GAUSSIAN
2004/9: MCLEISH, GENERALIZED MCLEISH, BESSEL I FUNCTION,
BESSEL K FUNCTION
2004/10: SET PROBABILITY DATA POINTS
2004/10: Support for censored data
2005/5: Support unequal bin widths for frequency data
Program:
    MULTIPLOT 2 2
MULTIPLOT CORNER COORDINATES 0 0 100 100
MULTIPLOT SCALE FACTOR 1.5
TITLE AUTOMATIC
X1LABEL THEORETICAL VALUE
Y1LABEL DATA VALUE
TITLE OFFSET 2
X1LABEL DISPLACEMENT 10
Y1LABEL DISPLACEMENT 14
CHAR X
LINE BLANK
.
LET Y = NORMAL RANDOM NUMBERS FOR I = 1 1 100
NORMAL PROBABILITY PLOT Y
.
LET NU = 5
LET Y = CHI-SQUARE RANDOM NUMBERS FOR I = 1 1 100
CHI-SQUARE PROBABILITY PLOT Y
.
LET Y = EXPONENTIAL RANDOM NUMBERS FOR I = 1 1 100
EXPONENTIAL PROBABILITY PLOT Y
.
LET Y = CAUCHY RANDOM NUMBERS FOR I = 1 1 1000
CAUCHY PROBABILITY PLOT Y
END OF MULTIPLOT


Date created: 08/30/2005
Last updated: 12/04/2023