Dataplot Vol 1 Vol 2

Name:
Type:
Support Command
Purpose:
Read data and compute selected statistics without reading the full data set into memory.
Description:
Dataplot is designed to be an "in memory" program. That is, when using the READ command, all of the data is read into memory. Although this is useful for running Dataplot interactively, large data sets are becoming more common. For these large data sets, it may not be possible to read all of the data into memory.

The STREAM READ was added to allow some of these large data sets to be read and certain statistics to be computed without reading the entire data set into memory. Although there is a limited amount of analyses that can be performed with this command, it may allow some useful initial exploratory analysis to be performed on these large data sets.

There are several variations of this command that will be described separately.

Syntax 1:
STREAM READ WRITE FILE.DAT <x1> <x2> ... <xk>
where <x1>, <x2>, ... <xk> is a list of variables to be read.

This version of the command is used to read the input file and to write a new version of the data using a specified Fortran-like format.

This command is useful in the following way. Large data files can take a long time to read. If you can use the SET READ FORMAT command to read the data, this can significantly speed up the reading of the data For example, reading the data set used by the example programs below used 24.7 cpu seconds on a Linux machine running CentOS. Performing the same read on the same platform with a SET READ FORMAT required 0.6 cpu seconds. Cpu times will vary depending on the hardware and operating system, but this is indicative of the relative performance improvement that can be obtained by using the SET READ FORMAT command. This example file is not particularly large (361,920 rows). The speed improvement becomes even more important when we start dealing with multiple millions of rows.

Often large data sets will initially not be in a format where the SET READ FORMAT can be used. So this command can be used once, with the SET WRITE FORMAT command, to create a new version of the file that is formatted in a way that the SET READ FORMAT can be used. This new file is then used for subsequent Dataplot sessions that use this data.

Syntax 2:
STREAM READ GROUP STATISTICS <stat> FILE.DAT <x1> <x2> ... <xk>
where <stat> is one of Dataplot's supported univariate statistics;
and where <x1>, <x2>, ... <xk> is a list of variables to be read.

This syntax will read the file a user-specified number of rows at a time. It will then replace those rows with the specified statistic. That is, the original data will be replaced with the specified statistic for fixed intervals of the data.

For example, you can read 1,000 rows, compute (and save) the mean for those 1,000 rows for each variable, then repeat for the next 1,000 rows. That is, the original data will be replaced with the means of fixed intervals of the data.

To specify the number of rows to read at a time, enter

Alternatively, you can specify one of the variabes to define the group (i.e., when the value of the specified variable changes, this denotes the start of a new group). For this option, enter

SET STREAM READ GROUP VARIBLE <var-name>

This capability is motivated by the desire to handle large data sets that may exceed Dataplot's storage limits. This command allows you to compute some basic statistics (mean, minimum, maximum, standard deviation, and so on) for slices of the data. Often, some useful exploratory analysis can be performed on this compressed data.

Syntax 3:
STREAM READ DEFAULT STATISTICS FILE.DAT <x1> <x2> ... <xk>
where <x1>, <x2>, ... <xk> is a list of variables to be read.

This is a variant of Syntax 2 that allows a default set of statistics to be computed on a single pass of the data. The following statistics are computed:

1. VALUE OF LAST ROW OF GROUP
2. GROUP-ID
3. SIZE
4. MINIMUM
5. MAXIMUM
6. MEAN
7. STANDARD DEVIATION
8. SKEWNESS
9. KURTOSIS
10. MEDIAN
11. INTERQUARTILE RANGE
12. RANGE
13. AUTOCORRELATION
14. LOWER QUARTILE
15. UPPER QUARTILE
16. 0.01 QUANTILE
17. 0.05 QUANTILE
18. 0.10 QUANTILE
19. 0.90 QUANTILE
20. 0.95 QUANTILE
21. 0.99 QUANTILE

For this syntax, a tag variable (TAGSTAT) will be created that defines the statistic (i.e., each row of TAGSTAT contains a value from 1 to 21). TAGSTAT can be used to exract the desired statistic for each group.

Syntax 4:
STREAM READ FULL STATISTICS FILE.DAT <x1> <x2> ... <xk>
where <x1>, <x2>, ... <xk> is a list of variables to be read.

This syntax will compute the following statistics using a 1-pass algorithm for all of the data:

1. SIZE
2. MINIMUM
3. MAXIMUM
4. MEAN
5. STANDARD DEVIATION
6. SKEWNESS
7. KURTOSIS
8. RANGE

Each of the <x1> ... <xk> will contain 8 rows containing the above eight statistics for each column read.

Examples:
SET WRITE FORMAT 10E15.7
STREAM READ WRITE BIG.DAT X1 TO X10

SET STREAM READ SIZE 100 STREAM READ GROUP STATISTIC MEAN BIG.DAT X1 TO X10 STREAM READ GROUP STATISTIC STANDARD DEVIATION BIG.DAT X1 TO X10 STREAM READ DEFAULT STATISTICS BIG.DAT X1 TO X10

STREAM READ FULL STATISTICS BIG.DAT X1 TO X10

Note:
Note that the STREAM READ command has a number of limitations compared to the standard READ command.

1. Functions/strings, parameters, matrices, character data, and images are not supported.

2. Reading from the clipboard is not supported.

3. Automatic name detection is not supported.

4. The STREAM READ command is restricted to files (i.e., reading from the terminal is not supported).
Default:
None
Synonyms:
None
Related Commands:
Applications:
Data Input
Implementation Date:
2016/07
Program 1:

. Step 1:   Demonstrate the group statistic option of stream read
.
skip 25
set stream read group variable rowid
stream read group statistics mean elliottr.dat redcolme rowid colid
.
. Step 2:   Generate plot of column means
.
title offset 2
title case asis
label case asis
.
title Column Means for Red Pixels for ELLIOTTR.DAT
y1label Column Mean
x1label Row
.
plot redcolme vs rowid
.
. Step 3:   Reset read settings
.
skip 0

Program 2:

. Step 1:   Demonstrate the default statistic option of stream read
.
skip 25
set stream read group variable rowid
stream read default statistics elliottr.dat red rowid colid
.
let redmean = red
retain redmean subset tagstat = 6
let redsd = red
retain redsd subset tagstat = 7
let redmin = red
retain redmin subset tagstat = 4
let redmax = red
retain redmax subset tagstat = 5
.
. Step 2:   Plot some of the statistics
.
multiplot corner coordinates 5 5 95 95
multiplot scale factor 2
multiplot 2 2
.
label case asis
title case asis
case asis
title offset 2
.
title Mean of Columns
plot redmean
.
title SD of Columns
plot redsd
.
title Minimum of Columns
plot redmin
.
title Maximum of Columns
plot redmax
.
end of multiplot
.
justification center
move 50 97
text Statistics for Columns of Red Pixels in ELLIOTTR.DAT
.
. Step 2:   Reset read settings
.
skip 0

Program 3:

. Step 1:   Demonstrate the default statistic option of stream read
.
skip 25
stream read full statistics elliottr.dat red rowid colid
.
. Step 2:   Print statistics for red variable
.
feedback off
set write decimals 2
print "Statistics for variable RED:"
print " "
print " "
let aval = red(1)
print "Size:      ^aval"
let aval = red(2)
print "Minimum:   ^aval"
let aval = red(3)
print "Maximum:   ^aval"
let aval = red(4)
let aval = round(aval,2)
print "Mean:      ^aval"
let aval = red(5)
let aval = round(aval,2)
print "SD:        ^aval"
let aval = red(6)
let aval = round(aval,2)
print "Skewness:  ^aval"
let aval = red(7)
let aval = round(aval,2)
print "Kurtosis:  ^aval"
let aval = red(8)
print "Range:     ^aval"
feedback on
.
. Step 3:   Reset read settings
.
skip 0

Statistics for variable RED:

Size:      361920
Minimum:   140
Maximum:   4095
Mean:      369.38
SD:        745.5
Skewness:  4.23
Kurtosis:  20
Range:     3955


NIST is an agency of the U.S. Commerce Department.

Date created: 07/24/2017
Last updated: 07/24/2017