**comment This is file PRINCIPL.TEX
General Principles of
Experiment Design & Data Analysis
James J. Filliben
1. Classical Data Analysis versus EDA (Exploratory Data Analysis):
Classical: data --> model* --> analysis --> conclusions
EDA: data* --> analysis --> model --> conclusions
2. Primary goal of EDA: insight
3. #1 tool of EDA: graphics
4. Plot the raw data
5. EDA is sequential:
data
simple statistics
univariate
multivariate
6. Essence of analysis (EDA): comparison
Essence of comparison: juxtaposition
therefore ==> multiplotting
7. When in doubt: test underlying assumptions
8. A number without an uncertainty is useless
9. Check outliers
10. EDA:
1. check global structure
2. check fine structure
11. To look at fine structure,
must subtract out global structure first
12. Optimal estimator of location depends
on the underlying distribution
13. Therefore, should "estimate" distribution first
before choosing location estimator
14. Estimate distribution via probability plots
15. Plot simple statistics:
ybar(i) vs i
sd(i) vs i
16. If multidimensional/multi-factors:
then use
1. multitrace
2. line types
3. character types
4. color
5. dynamics
6. multiplotting
17. Global truth <===> subset truth
If looking to deduce a global truth,
then must be true over all/most subsets
18. If multivariable,
then look at
1. within (absolute)
2. between (relative)
19. For graphics:
keep data density up
keep chart junk down
keep white space down
20. In graphics, don't waste white space.
therefore ==> multitracing & multiplotting
21. As the sample size n increases,
and the number of factors k increases,
then the importance of graphics increases.
22. Goal of quantitative analysis: rigor
Goal of graphical analysis: sufficiency
save time
insight
structure
23. Conjecture:
If can "prove it" quantitatively,
then there exist some graph that can
show/demonstrate it better.
24. Value of current computers
is not fast number-crunching,
it is fast graphics.
25. Local conclusions need not be true globally, but
global conclusions must be true locally;
therefore ==> subset plots
26. We can understand univariate structure
without understanding multivariate structure,
but
we cannot understand multivariate structure
without understanding univariate structure;
therefore ==> univariate: do univariate first.
27. Analysis is a comparison operation--
not
a memory operation;
therefore ==> multiplot plots per page.
28. Ideal number of points/page: 1000 to 100,000;
but typical number is 10 to 100;
therefore ==> multiplotting
29. Which variable to put on horizontal axis:
1. the one with the most number of levels
2. the nuisance factor (as in block plot)
30. Impediments to structure-extraction: chart junk:
1. tics
2. tic labels
3. dead space between plots
4. grids
Bare necessity: data & frames
31. Connecting lines or not?
Do both! (fast graphics)
Lines: +: emphasize autocorrelation strucutre
-: impede distributional structure
1. randomness*
2. fixed distribution*
3. fixed location
4. fixed variation
32. If have a factor with 2 levels, then use
1. bihistogram
2. empirical Q-Q plot
3. Youden plot
33. For multivariate, think
within: fine structure/univariate
as well as
between: global structure/multivariate
34. DEX sampling plan
is more important than
data analysis (GIGO: garbage in, garbage out)
35. Design of Experiment / Data Analysis Bridge:
DEX DAN
Problem =========> Data =========> Conclusions
36. Assumptions:
1. Most stat procedures have assumptions,
validity of engineering conclusions
depends on
validity of underlying assumptions
therefore ==> test assumptions
2. identify assumptions:
1. randomness
2. fixed distribution
3. fixed location
4. fixed variation
3. if given a choice (always!),
use those stat procedures
with fewer (or no) assumptions
e.g., use block plot rather than ANOVA F test
e.g., use binomial test rather than normal test
e.g., plot raw data (no assumptions!)
37. Data Analysis Approaches:
How ensure that our conclusions are
not dependent
on the statistical approach we choose?
1. use data-based procedures only,
rather than stat-based procedures
(e.g., average & median)
2. use graphical procedures involving only data
3. use procedures with few assumptions
4. use multiple procedures (at least 2)
5. use assumption-robust / data-resistive procedures
6. use distributionally-correct procedures
e.g., average for normal & median for Cauchy
7. do in proper order:
check distribution first
check outliers first & body of data later
check univariate first & multivariate later
check subsets first & global later
(data analysis itself is an ordered operation)
38. Valid conclusions:
Bridge:
DEX Data analysis
problem & goal ===========> data ==========> conclusion
construct exp.
39. Valid Conclusions:
= f(problem & goal,
crisply stated
well-defined scope
DEX/goal pigeonhole
DEX (= design of experiment),
choice of DEX: orthogonal
randomization
replication
blocking
use of a control
frequent calibration/SRM
hardware/equipment
materials
measuring devie
software/experimentation method
environment
factory/site
how conduct the experiment
DEX pigeonhole
data,
outliers
DAN (= data analysis technique),
power of statistical procedures
correctness of statistical approach
validity of statistical assumptions
assumption-robust / outlier-resistive
software
statistician )
40. What DEX (design of experiment) to use
depends on
goal pigeonhole
41. What DAN (data analysis technique) to use
depends on
goal pigeonhole
42. If starting with a data set,
then must define goals first
(how do we know when we're done?)
before get bogged down
by the details of the data analysis technique
43. To do goal pigeonholing--
make a multiple-choice
"walk away with what" list (of nouns/objects):
1. a typical value (location parameter) (a number)
2. a distribution & its parameters
3. a yes/no statement (a conclusion) as to whether
an engineering modification (different settings
of a factor) made a difference/improvement
4. a ranked list of factors (k factors)
5. a ranked list of best settings (k numbers)
6. a function & its parameters
7. a time series function (in time) & its parameters
44. If a list/plot is good,
then a sorted list/plot is better
45. DEX & DAN are dictated by goals & pigeonholing;
goals & pigeonholing are dictated by the experimentalist.
What is the specific goal of the experiment?
46. Setting project goals is
not
a statistics question
47. The primary question to distinguish
between a comparative problem
versus a screening problem:
1. Are all factors equally important a priori? yes/no
2. Do you care if:
you find differences in factor ...? yes/no
is factor ... out of control?
is factor ... a characteristic of the
(unchangeable) population?
48. Goals come first, techniques come later.
49. If standing on the desert shore of west Africa,
must decide first as to goal:
1. where you want to be at the end
before you decide whether you need (general techniques) a
1. desert guide
2. safari leader
3. boat captain
& well before you decide whether you need (detailed techniques) a
1. camel
2. machetti
3. boat
50. Primary project goal question:
How do you know when you are done?