SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Vol 2

STREAM READ

Name:
    STREAM READ
Type:
    Support Command
Purpose:
    Read data and compute selected statistics without reading the full data set into memory.
Description:
    Dataplot is designed to be an "in memory" program. That is, when using the READ command, all of the data is read into memory. Although this is useful for running Dataplot interactively, large data sets are becoming more common. For these large data sets, it may not be possible to read all of the data into memory.

    The STREAM READ was added to allow some of these large data sets to be read and certain statistics to be computed without reading the entire data set into memory. Although there is a limited amount of analyses that can be performed with this command, it may allow some useful initial exploratory analysis to be performed on these large data sets.

    There are several variations of this command that will be described separately.

Syntax 1:
    STREAM READ WRITE FILE.DAT <x1> <x2> ... <xk>
    where <x1>, <x2>, ... <xk> is a list of variables to be read.

    This version of the command is used to read the input file and to write a new version of the data using a specified Fortran-like format.

    This command is useful in the following way. Large data files can take a long time to read. If you can use the SET READ FORMAT command to read the data, this can significantly speed up the reading of the data For example, reading the data set used by the example programs below used 24.7 cpu seconds on a Linux machine running CentOS. Performing the same read on the same platform with a SET READ FORMAT required 0.6 cpu seconds. Cpu times will vary depending on the hardware and operating system, but this is indicative of the relative performance improvement that can be obtained by using the SET READ FORMAT command. This example file is not particularly large (361,920 rows). The speed improvement becomes even more important when we start dealing with multiple millions of rows.

    Often large data sets will initially not be in a format where the SET READ FORMAT can be used. So this command can be used once, with the SET WRITE FORMAT command, to create a new version of the file that is formatted in a way that the SET READ FORMAT can be used. This new file is then used for subsequent Dataplot sessions that use this data.

Syntax 2:
    STREAM READ GROUP STATISTICS <stat> FILE.DAT <x1> <x2> ... <xk>
    where <stat> is one of Dataplot's supported univariate statistics;
    and where <x1>, <x2>, ... <xk> is a list of variables to be read.

    This syntax will read the file a user-specified number of rows at a time. It will then replace those rows with the specified statistic. That is, the original data will be replaced with the specified statistic for fixed intervals of the data.

    For example, you can read 1,000 rows, compute (and save) the mean for those 1,000 rows for each variable, then repeat for the next 1,000 rows. That is, the original data will be replaced with the means of fixed intervals of the data.

    To specify the number of rows to read at a time, enter

      SET STREAM READ SIZE <value>

    Alternatively, you can specify one of the variabes to define the group (i.e., when the value of the specified variable changes, this denotes the start of a new group). For this option, enter

      SET STREAM READ GROUP VARIBLE <var-name>

    This capability is motivated by the desire to handle large data sets that may exceed Dataplot's storage limits. This command allows you to compute some basic statistics (mean, minimum, maximum, standard deviation, and so on) for slices of the data. Often, some useful exploratory analysis can be performed on this compressed data.

Syntax 3:
    STREAM READ DEFAULT STATISTICS FILE.DAT <x1> <x2> ... <xk>
    where <x1>, <x2>, ... <xk> is a list of variables to be read.

    This is a variant of Syntax 2 that allows a default set of statistics to be computed on a single pass of the data. The following statistics are computed:

    1. VALUE OF LAST ROW OF GROUP
    2. GROUP-ID
    3. SIZE
    4. MINIMUM
    5. MAXIMUM
    6. MEAN
    7. STANDARD DEVIATION
    8. SKEWNESS
    9. KURTOSIS
    10. MEDIAN
    11. INTERQUARTILE RANGE
    12. RANGE
    13. AUTOCORRELATION
    14. LOWER QUARTILE
    15. UPPER QUARTILE
    16. 0.01 QUANTILE
    17. 0.05 QUANTILE
    18. 0.10 QUANTILE
    19. 0.90 QUANTILE
    20. 0.95 QUANTILE
    21. 0.99 QUANTILE

    For this syntax, a tag variable (TAGSTAT) will be created that defines the statistic (i.e., each row of TAGSTAT contains a value from 1 to 21). TAGSTAT can be used to exract the desired statistic for each group.

Syntax 4:
    STREAM READ FULL STATISTICS FILE.DAT <x1> <x2> ... <xk>
    where <x1>, <x2>, ... <xk> is a list of variables to be read.

    This syntax will compute the following statistics using a 1-pass algorithm for all of the data:

    1. SIZE
    2. MINIMUM
    3. MAXIMUM
    4. MEAN
    5. STANDARD DEVIATION
    6. SKEWNESS
    7. KURTOSIS
    8. RANGE

    Each of the <x1> ... <xk> will contain 8 rows containing the above eight statistics for each column read.

Examples:
    SET WRITE FORMAT 10E15.7
    STREAM READ WRITE BIG.DAT X1 TO X10

    SET STREAM READ SIZE 100 STREAM READ GROUP STATISTIC MEAN BIG.DAT X1 TO X10 STREAM READ GROUP STATISTIC STANDARD DEVIATION BIG.DAT X1 TO X10 STREAM READ DEFAULT STATISTICS BIG.DAT X1 TO X10

    STREAM READ FULL STATISTICS BIG.DAT X1 TO X10

Note:
    Note that the STREAM READ command has a number of limitations compared to the standard READ command.

    1. Functions/strings, parameters, matrices, character data, and images are not supported.

    2. Reading from the clipboard is not supported.

    3. Automatic name detection is not supported.

    4. The STREAM READ command is restricted to files (i.e., reading from the terminal is not supported).
Default:
    None
Synonyms:
    None
Related Commands: Applications:
    Data Input
Implementation Date:
    2016/07
Program 1:
     
    . Step 1:   Demonstrate the group statistic option of stream read
    .
    skip 25
    set read format 3F7.0
    set stream read group variable rowid
    stream read group statistics mean elliottr.dat redcolme rowid colid
    .
    . Step 2:   Generate plot of column means
    .
    title offset 2
    title case asis
    label case asis
    .
    title Column Means for Red Pixels for ELLIOTTR.DAT
    y1label Column Mean
    x1label Row
    .
    plot redcolme vs rowid
    .
    . Step 3:   Reset read settings
    .
    skip 0
    set read format
        
    plot generated by sample program
Program 2:
     
    . Step 1:   Demonstrate the default statistic option of stream read
    .
    skip 25
    set read format 3F7.0
    set stream read group variable rowid
    stream read default statistics elliottr.dat red rowid colid
    .
    let redmean = red
    retain redmean subset tagstat = 6
    let redsd = red
    retain redsd subset tagstat = 7
    let redmin = red
    retain redmin subset tagstat = 4
    let redmax = red
    retain redmax subset tagstat = 5
    .
    . Step 2:   Plot some of the statistics
    .
    multiplot corner coordinates 5 5 95 95
    multiplot scale factor 2
    multiplot 2 2
    .
    label case asis
    title case asis
    case asis
    title offset 2
    .
    title Mean of Columns
    plot redmean
    .
    title SD of Columns
    plot redsd
    .
    title Minimum of Columns
    plot redmin
    .
    title Maximum of Columns
    plot redmax
    .
    end of multiplot
    .
    justification center
    move 50 97
    text Statistics for Columns of Red Pixels in ELLIOTTR.DAT
    .
    . Step 2:   Reset read settings
    .
    skip 0
    set read format
        
    plot generated by sample program
Program 3:
     
    . Step 1:   Demonstrate the default statistic option of stream read
    .
    skip 25
    set read format 3F7.0
    stream read full statistics elliottr.dat red rowid colid
    .
    . Step 2:   Print statistics for red variable
    .
    feedback off
    set write decimals 2
    print "Statistics for variable RED:"
    print " "
    print " "
    let aval = red(1)
    print "Size:      ^aval"
    let aval = red(2)
    print "Minimum:   ^aval"
    let aval = red(3)
    print "Maximum:   ^aval"
    let aval = red(4)
    let aval = round(aval,2)
    print "Mean:      ^aval"
    let aval = red(5)
    let aval = round(aval,2)
    print "SD:        ^aval"
    let aval = red(6)
    let aval = round(aval,2)
    print "Skewness:  ^aval"
    let aval = red(7)
    let aval = round(aval,2)
    print "Kurtosis:  ^aval"
    let aval = red(8)
    print "Range:     ^aval"
    feedback on
    .
    . Step 3:   Reset read settings
    .
    skip 0
    set read format
        
    Statistics for variable RED:
     
     
    Size:      361920
    Minimum:   140
    Maximum:   4095
    Mean:      369.38
    SD:        745.5
    Skewness:  4.23
    Kurtosis:  20
    Range:     3955
        

Privacy Policy/Security Notice
Disclaimer | FOIA

NIST is an agency of the U.S. Commerce Department.

Date created: 07/24/2017
Last updated: 07/24/2017

Please email comments on this WWW page to alan.heckert.gov.