SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Vol 2

BEST DISTRIBUTIONAL FIT

Name:
    BEST DISTRIBUTIONAL FIT
Type:
    Analysis Command
Purpose:
    Generate a ranked list of best distributional fit for univariate data.
Description:
    A common task is to find a good distributional fit to a set of univariate data. This command can be used as a screening tool to identify good candidate models.

    There are two steps in this process:

    1. Fitting
    2. Ranking by a goodness of fit critierion

    You can specify the fit method with the command

      SET BEST FIT METHOD <value>

    where <value> is one of the following

      MAXIMUM LIKELIHOOD: maximum likelihood
      PPCC: PPCC goodness of fit
      ANDERSON DARLING: Anderson-Darling goodness of fit
      KOLMOGOROV SMIRNOV: Kolmogorov-Smirnov goodness of fit

    The default method is maximum likelihood.

    You can specify the goodness of fit critierion with the command

      SET BEST FIT CRITERION <value>

    where <value> is one of the following

      ANDERSON DARLING: Anderson-Darling
      KOLMOGOROV SMIRNOV: Kolmogorov-Smirnov
      PPCC: PPCC
      AIC: Akaike Information Criterion
      AICc: Akaike Information Criterion corrected for sample size
      BIC: Bayesian Information Criterion

    The default goodness of fit criterion is Anderson-Darling.

    Note that this command is intended strictly as a screening tool to identify good candidate distributions. You should perform a more complete analysis once you identify appropriate candidate distributions. Also, you may be able to improve the fit for certain distributions by fine tuning the starting values.

    We do not recommend simply selecting the "best" distribution from the list. Rather this command is meant to identify good candidate models that should be examined more carefully. For example, a simpler distribution that provides nearly as good a fit as a more complicated distribution may be preferred. In some cases, a distribution that has a more meaningful physical interpretation or has established usage in a given area of work may be preferred.

    For performance reasons, not all possible distributions are included.

Syntax 1:
    BEST DISTRIBUTIONAL FIT <y>             <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response variable;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    For this syntax, the response variable can be a matrix.

Syntax 2:
    MULTIPLE BEST DISTRIBUTIONAL FIT <y1> ... <yk>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> ... <yk> is a list of 1 to 30 response variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax generates a best distributional analysis for each listed response variable. These response variables can be matrices.

Syntax 3:
    REPLICATED BEST DISTRIBUTIONAL FIT <y> <x1> ... <xk>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y> is a response variable;
    <x1> ... <xk> is a list of 1 to 6 group-id variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax peforms a cross-tabulation of <x1> ... <xk> and performs the best distributional fit analysis for each unique combination of cross-tabulated values. For example, if X1 has 3 levels and X2 has 2 levels, there will be a total of 6 best distributional fit analyses performed.

Examples:
    BEST DISTRIBUTIONAL FIT Y
    BEST DISTRIBUTIONAL FIT Y SUBSET TAG = 1
    MULTIPLE BEST DISTRIBUTIONAL FIT Y1 TO Y5
    REPLICATED BEST DISTRIBUTIONAL FIT Y X
Note:
    This command is currently limited to raw data (i.e., not binned) and continuous distributions.
Note:
    If fitting is performed using maximum likelihood, the generalzied Pareto and generalized extreme value distributions will be fit using the elemental percentileds method. This is done since the maximum likelihood, moment, and L-moment estimates may not be valid for certain ranges of the distribution parameters.

    Also, distributions that expect all positive (or negative) numbers will be shifted appropriately before performing the maximum likelihood estimation.

    Since this command is intended as a quick screening method, not all methods for which Dataplot supports maximum likelihood estimation are included.

Note:
    If fitting is performed using one of the goodness of fit statistics (i.e., PPCC, Anderson-Darling, Kolmogorov-Smirnov), then the distributions are limited to location-scale distributions or distributions with a single shape parameter. The one exeception is the G and H distribution (which has two shape parameters).

    This restriction is primarily for performance reasons.

Note:
    The AIC is computed as

      AIC = 2*k - 2*LN(L)

    with k denoting the number of parameters being fit and L is the maximized value of the likelihood function.

    The AICc is computed as

      AICc = AIC + 2*k*(k+1)/(n-k-1)

    The AICc is recommended over the AIC when the sample size is small or k is large. Since AICc converges to AIC for large n, some analysts prefer to use AICc rather than AIC for all cases.

    The BIC is computed as

      -2*LN(L) + k*LN(n)

    The penalty term for extra parameters is larger in the BIC than it is for the AIC.

Note:
    The PPCC ranking method is based on the "most linear" probability plot where linearity is measured by correlation coefficient of the points on the probability plot. The probability plot has the property that it is invariant to location and scale. In practical terms, this means that the linearity of the probability plot only depends on the shape parameters, not the location and scale parameters.

    So if we use a non-PPCC method to estimate the parameters and use the PPCC as a ranking method, there is an additional implicit estimate for the location and scale parameters. For this reason, the PPCC ranking method is only supported when a PPCC fitting method is used.

Note:
    Some distributions are bounded and are more commonly defined by the lower and upper limits rather than the location and scale parameters (for these distributions, the lower limit is the location parameter and the upper limit minus the lower limit is the scale parameter). Examples include the uniform, beta, power, and Topp and Leone distributions. In the output table, these distributions are marked with an asterisk to indicate that the estimated values for the location and scale parameters are actually the estimates for the lower and upper limit, respectively.
Note:
    If the following command is given

      SET BEST FIT FONG ON

    then two additional columns are given.

    The first additional column gives the PDF value at 0. The second additional column is set to 1 if the distribution has an infinite lower tail and it is set to 0 if the distribution has a bounded lower tail.

    For the first additional column, the following commands can also be used

      SET BEST FIT FONG TYPE <PDF/CDF>
      SET BEST FIT FONG XVALUE <VALUE>

    The SET BEST FIT FONG TYPE command is used to specify whether we want to print the CDF or the PDF value (the default is the PDF value). The SET BEST FIT FONG XVALUE command specifies at what value (the default is 0) we want to print the PDF (or CDF value).

    These options were added at the request of Jeffrey Fong. For lifetime data (and many other types of measurement data), we will only have positive measurements. So it is of interest whether we have non-negligble probability below some threshold value. It may be that some distributions indicate good fit, but they are not appropriate distributional models due to non-negligible probability for inadmissable values.

    Note that unbounded distributions may provide adequate distributional models for these cases if the cumulative probability below some threshold is practically, even if not theoretically, zero.

Note:
    For the 2-parameter Beta, power, reflected power, Topp and Leone, reflected generalized Topp and Leone, and two-sided power distributions, you can fix the lower and upper limits with the commands

      LET LOWLIMIT = <value>
      LET UPPLIMIT = <value>

    These limits will apply to all of these distributions. There is currently no way to specify different lower and upper limits for the different distributions.

    If these values are not set, then default values based on the data will be used (typically based on the minimum and maximum values of the data).

Default:
    None
Synonyms:
    ML is a synonym for MAXIMUM LIKELIHOOD
    AD is a synonym for ANDERSON DARLING
    KS is a synonym for KOLMOGOROV SMIRNOV
Related Commands: Applications:
    Distributional Modeling
Implementation Date:
    2011/03
    2012/10: Added AIC, AICC, and BIC ranking methods
Program:
     
    .  Step 1: Read the data
    .
    .          Following data from Jeffery Fong of the NIST
    .          Applied and Computational Mathematics Division.
    .          This is strength data in ksi units.
    .
    read y
    18.830
    20.800
    21.657
    23.030
    23.230
    24.050
    24.321
    25.500
    25.520
    25.800
    26.690
    26.770
    26.780
    27.050
    27.670
    29.900
    31.110
    33.200
    33.730
    33.760
    33.890
    34.760
    35.750
    35.910
    36.980
    37.080
    37.090
    39.580
    44.045
    45.290
    45.381
    end of data
    .
    set write decimals 5
    .
    .  Step 2: Apply goodness of fit tests for Weibull distribution
    .          based on ML estimates
    .
    .  Maximum likelihood method
    .
    set best fit method     ml
    set best fit criterion  anderson darling
    best distributional fit  y
    .
    set best fit method     ml
    set best fit criterion  kolm smir
    best distributional fit  y
    .
    .  PPCC method
    .
    set best fit method     ppcc
    set best fit criterion  ppcc
    best distributional fit  y
        
    The following output is generated.
                Best Distributional Fit
     
    Response Variable: Y
     
    Fit Method: Maximum Likelihood
    Ranking Criterion: Anderson Darling
     
    Summary Statistics:
    Number of Observations:                  31
    Sample Minimum:                            18.83000
    Sample Maximum:                            45.38100
    Sample Mean:                               30.81142
    Sample SD:                                 7.253381
     
    Ranked List of Best Fit
    ----------------------------------------------------------------------------------------------------
                                    Goodness       Estimate       Estimate       Estimate       Estimate
                                      of Fit             of             of       of Shape       of Shape
    Distribution                   Statistic       Location          Scale    Parameter 1    Parameter 2
    ----------------------------------------------------------------------------------------------------
    *TRIANGULAR                0.3332130       17.33848       49.26861       25.50000                 **
    3-PAR WEIBULL (MINIMUM)    0.3380554       17.64420       14.83507       1.913580                 **
    3-PAR INVERSE GAUSSIAN     0.3874339       6.764274       1.000000       255.1458       24.04715
    2-PAR LOGNORMAL            0.3888329                 **   30.00134      0.2349026                 **
    GUMBEL (MAXIMUM)           0.3980371       27.39966       5.986812                 **             **
    3-PAR LOGNORMAL            0.3984710       6.066865       23.72709      0.2917821                 **
    2-PAR INVERTED GAMMA       0.4062290                 **   555.0562       18.99880                 **
    2-PAR INVERSE GAUSSIAN     0.4067194       0.000000       1.000000       563.9835       30.81142
    *4-PAR BETA (MOMENTS)      0.4108797       18.80345       50.45846       1.307856       2.199731
    2-PAR GAMMA                0.4386866                 **   1.627518       18.93154                 **
    2-PAR BURR TYPE 10         0.4464637                 **   19.47685       7.276202                 **
    2-PAR FRECHET (MAX)        0.4681346                 **   26.74577       4.659726                 **
    2-PAR INVERTED WEIBULL     0.4681357                 **   26.74577       4.659730                 **
    LOGISTIC EXPONENTIAL       0.4897890                 **   43.44812       5.187883                 **
    NORMAL                     0.5321921       30.81142       7.253381                 **             **
    FOLDED NORMAL              0.5559204       30.81142       7.135432                 **             **
    LOGISTIC                   0.5728510       30.44662       4.224463                 **             **
    2-PAR WEIBULL (MINIMUM)    0.5973435                 **   33.67424       4.635390                 **
    *REFLECTED POWER           0.7151471       18.56449       45.64651       1.110959                 **
    *2-PAR BETA                0.7213293       18.56449       45.64651       1.021440       1.126894
    *REFL GENE TOPP AND LEONE  0.8370549       18.80345       45.40755      0.5000000      0.7780750
    BIRNBAUM SAUNDERS          0.8477387                 **   30.00279      0.3283453                 **
    SLASH                      0.8526063       30.46421       3.523827                 **             **
    *POWER                     0.8635646       18.56449       45.64651      0.9463008                 **
    DOUBLE EXPONENTIAL         0.8691080       29.90000       6.124452                 **             **
    RAYLEIGH                   0.9356298       18.79377       9.882772                 **             **
    GUMBEL (MININUM)           0.9867376       34.50269       7.278262                 **             **
    CAUCHY                      1.200882       29.25895       5.093631                 **             **
    *TOPP AND LEONE             1.264385       18.56449       45.64651       1.573167                 **
    1-PAR MAXWELL               2.618183                 **   18.25977                 **             **
    PARETO                      3.437629       0.000000       1.000000       2.077101       18.53761
    2-PAR WEIBULL (MAXIMUM)     3.657144                 **   14.76277       1.088025                 **
    2-PAR EXPONENTIAL           4.013110       18.83000       11.98142                 **             **
    *UNIFORM                    5.244683       18.83000       45.38100                 **             **
    2-PAR FRECHET (MIN)         8.330173                 **  0.9246548      0.1772251                 **
    2-COMP NORMAL MIXTURE       23.99764                 **             **   24.57355       35.74121
     
     
    * denotes lower/upper limit rather than location/scale
     
     
                Best Distributional Fit
     
    Response Variable: Y
     
    Fit Method: Maximum Likelihood
    Ranking Criterion: Kolmogorov Smirn
     
    Summary Statistics:
    Number of Observations:                  31
    Sample Minimum:                            18.83000
    Sample Maximum:                            45.38100
    Sample Mean:                               30.81142
    Sample SD:                                 7.253381
     
     
    Ranked List of Best Fit
    ----------------------------------------------------------------------------------------------------
                                    Goodness       Estimate       Estimate       Estimate       Estimate
                                      of Fit             of             of       of Shape       of Shape
    Distribution                   Statistic       Location          Scale    Parameter 1    Parameter 2
    ----------------------------------------------------------------------------------------------------
    *4-PAR BETA (MOMENTS)      0.1014130       18.80345       50.45846       1.307856       2.199731
    *TRIANGULAR                0.1113989       17.33848       49.26861       25.50000                 **
    3-PAR WEIBULL (MINIMUM)    0.1170822       17.64420       14.83507       1.913580                 **
    2-PAR LOGNORMAL            0.1219492                 **   30.00134      0.2349026                 **
    2-PAR INVERSE GAUSSIAN     0.1236868       0.000000       1.000000       563.9835       30.81142
    3-PAR LOGNORMAL            0.1287544       6.066865       23.72709      0.2917821                 **
    BIRNBAUM SAUNDERS          0.1297689                 **   30.00279      0.3283453                 **
    3-PAR INVERSE GAUSSIAN     0.1298348       6.764274       1.000000       255.1458       24.04715
    2-PAR INVERTED GAMMA       0.1318310                 **   555.0562       18.99880                 **
    2-PAR BURR TYPE 10         0.1326033                 **   19.47685       7.276202                 **
    LOGISTIC EXPONENTIAL       0.1329623                 **   43.44812       5.187883                 **
    SLASH                      0.1342469       30.46421       3.523827                 **             **
    2-PAR GAMMA                0.1349165                 **   1.627518       18.93154                 **
    GUMBEL (MAXIMUM)           0.1358038       27.39966       5.986812                 **             **
    *TOPP AND LEONE            0.1400991       18.56449       45.64651       1.573167                 **
    LOGISTIC                   0.1425181       30.44662       4.224463                 **             **
    2-PAR FRECHET (MAX)        0.1456706                 **   26.74577       4.659726                 **
    2-PAR INVERTED WEIBULL     0.1456709                 **   26.74577       4.659730                 **
    *REFLECTED POWER           0.1489988       18.56449       45.64651       1.110959                 **
    *2-PAR BETA                0.1491331       18.56449       45.64651       1.021440       1.126894
    NORMAL                     0.1513989       30.81142       7.253381                 **             **
    2-PAR WEIBULL (MINIMUM)    0.1525868                 **   33.67424       4.635390                 **
    FOLDED NORMAL              0.1539952       30.81142       7.135432                 **             **
    RAYLEIGH                   0.1570339       18.79377       9.882772                 **             **
    DOUBLE EXPONENTIAL         0.1598958       29.90000       6.124452                 **             **
    GUMBEL (MININUM)           0.1601804       34.50269       7.278262                 **             **
    CAUCHY                     0.1612234       29.25895       5.093631                 **             **
    *REFL GENE TOPP AND LEONE  0.1625837       18.80345       45.40755      0.5000000      0.7780750
    *POWER                     0.1728242       18.56449       45.64651      0.9463008                 **
    *UNIFORM                   0.1832347       18.83000       45.38100                 **             **
    2-PAR EXPONENTIAL          0.2010937       18.83000       11.98142                 **             **
    1-PAR MAXWELL              0.2417328                 **   18.25977                 **             **
    PARETO                     0.2660583       0.000000       1.000000       2.077101       18.53761
    2-PAR WEIBULL (MAXIMUM)    0.2845987                 **   14.76277       1.088025                 **
    2-PAR FRECHET (MIN)        0.4239459                 **  0.9246548      0.1772251                 **
    2-COMP NORMAL MIXTURE      0.7554719                 **             **   24.57355       35.74121
     
     
    * denotes lower/upper limit rather than location/scale
     
     
                Best Distributional Fit
     
    Response Variable: Y
     
    Fit Method: PPCC
    Ranking Criterion: PPCC
     
    Summary Statistics:
    Number of Observations:                  31
    Sample Minimum:                            18.83000
    Sample Maximum:                            45.38100
    Sample Mean:                               30.81142
    Sample SD:                                 7.253381
     
     
    Ranked List of Best Fit
    ----------------------------------------------------------------------------------------------------
                                    Goodness       Estimate       Estimate       Estimate       Estimate
                                      of Fit             of             of       of Shape       of Shape
    Distribution                   Statistic       Location          Scale    Parameter 1    Parameter 2
    ----------------------------------------------------------------------------------------------------
    *TRIANGULAR                0.9903866       33.78641       50.83147       24.96409                 **
    *TOPP AND LEONE            0.9898597       19.21407       64.69333       1.256061                 **
    GENE TUKEY LAMBDA          0.9898393       29.50906       7.489485      0.7000000      0.3000000
    GENE PARETO (MAX)          0.9892061       20.18056       17.07700     -0.6000000                 **
    *REFLECTED POWER           0.9892025       20.24849       28.72833       1.708333                 **
    3-PAR WEIBULL (MINIMUM)    0.9882787       16.08011       16.72055       2.104016                 **
    GENE EXT VAL (MIN)         0.9882166       32.88338       7.894681      0.4509018                 **
    RAYLEIGH                   0.9881071       16.73097       11.30290                 **             **
    3-PAR BURR TYPE 10         0.9881069       16.71816       15.98673       1.001807                 **
    2-PAR MAXWELL              0.9877088       13.32804       10.99731                 **             **
    BRADFORD                   0.9871834       20.53644       24.96087       1.907258                 **
    3-PAR WEIBULL (MAXIMUM)    0.9869599       74.55574       46.73858       6.913655                 **
    GENE EXT VAL (MAX)         0.9869491       27.83139       6.788558      0.1503006                 **
    3-PAR GAMMA                0.9867522       5.208567       2.171618       11.82349                 **
    WALD                       0.9865174      -8.570647       39.45313       27.85562                 **
    BIRNBAUM SAUNDERS          0.9865080      -6.302986       36.45857      0.2002008                 **
    G                          0.9863602       30.19909       7.290451      0.1859296                 **
    G AND H                    0.9863597       30.21816       7.299022      0.1800000       0.000000
    DOUBLE GAMMA               0.9863375       30.81142       2.345620       2.705221                 **
    3-PAR LOGNORMAL            0.9863089      -6.151920       36.30533      0.2002008                 **
    3-PAR INVERTED GAMMA       0.9862202      -19.79904       2367.047       47.70000                 **
    DOUBLE WEIBULL             0.9859972       30.81142       7.044318       1.703213                 **
    3-PAR GEOM EXTREME EXPO    0.9856844       17.88092       4.973479       10.82149                 **
    *POWER                     0.9851532       21.03185       23.74664      0.7032828                 **
    GENE PARETO (MIN)          0.9851244       44.69703       33.26714      -1.400000                 **
    ERROR                      0.9830244       30.81142       12.18327       3.400000                 **
    TUKEY-LAMBDA               0.9827152       30.81142       6.945565      0.4000000                 **
    ANGLIT                     0.9826880       30.81142       21.03343                 **             **
    COSINE                     0.9823948       30.81142       6.379693                 **             **
    HALF-NORMAL                0.9822081       21.15793       12.25805                 **             **
    LOGISTIC EXPONENTIAL       0.9813604       2.311842       40.26116       4.823990                 **
    GUMBEL (MAXIMUM)           0.9806594       27.52850       5.921301                 **             **
    LOG LOGISTIC               0.9804096      -18.27834       48.59806       11.74677                 **
    NORMAL                     0.9801995       30.81142       7.382224                 **             **
    *UNIFORM                   0.9788557       18.56336       24.49611                 **             **
    3-PAR FRECHET (MAX)        0.9787403      -261.9942       289.4954       50.00000                 **
    3-PAR INVERTED WEIBULL     0.9787403      -261.9942       289.4954       50.00000                 **
    LOGISTIC                   0.9750343       30.81142       4.173573                 **             **
    LOG GAMMA                  0.9748592      -85.60675       36.38708       25.00000                 **
    HYPERBOLIC SECANT          0.9700100       30.81142       4.870712                 **             **
    ASYMM DOUBLE EXPO          0.9686984       27.98396       7.188828      0.7532995                 **
    ARCSINE                    0.9638854       21.02687       19.56909                 **             **
    LOG DOUBLE EXPONENTIAL     0.9635815      -23.48322       53.86625       10.00000                 **
    DOUBLE EXPONENTIAL         0.9590572       30.81142       5.438864                 **             **
    2-PAR EXPONENTIAL          0.9479883       23.50395       7.529710                 **             **
    GUMBEL (MININUM)           0.9350657       33.94171       5.646002                 **             **
    3-PAR FRECHET (MIN)        0.9300918       309.0630       275.1060       50.00000                 **
    SLASH                      0.8225079       30.81142       1.106939                 **             **
    CAUCHY                     0.8092121       30.81142       1.378849                 **             **
     
     
    * denotes lower/upper limit rather than location/scale
     
        

Privacy Policy/Security Notice
Disclaimer | FOIA

NIST is an agency of the U.S. Commerce Department.

Date created: 09/22/2011
Last updated: 05/11/2016

Please email comments on this WWW page to alan.heckert@nist.gov.