ASCII FILES

This section provides guidance on reading ASCII data files in Dataplot. This includes discussion of some commands added to the 1/2004 version of Dataplot. In particular, discussion is included for ASCII files created by the Excel program.

Dataplot has limited support for binary data files. Currently, only binary files created using Fortran unformatted WRITE are supported. Enter HELP SET READ FORMAT for details.

Also, Dataplot does not currently support directly reading files from other statistical/spreadsheet programs or database files. Some support may be provided in future releases, but for now you need to save the data from these programs in an ASCII file in order to read them into Dataplot. XML based data files are becoming increasingly popular as well. At this time, Dataplot does not support XML based data files, although we anticipate looking at this issue for subsequent releases.

IDEAL CASE

By default, Dataplot assumes rectangular data files containing numeric data where the data columns are separated by one or more spaces, commas, or tabs.

In this case, you can read the file with a command like the following:

READ FILE.DAT Y X1 X2

The first argument after the READ is the name of the ASCII file. The remaining arguments identify the variable names. Variable names can be up to eight characters long and should be limited to alphabetic (A-Z) and numeric (0-9) characters. Although other characters can in fact be used, this is discouraged since their use can cause problems in some contexts. Variable names are not case sensitive (Dataplot converts all alphabetic characters to upper case). Variable names are separated with one or more spaces (commas are not allowed as delimiters in this context).

Dataplot recognizes the first argument as a file name if it finds a "." in the name. If no "." is found, Dataplot assumes the first argument is a variable name and it tries to read from the keyboard rather than the file.

The remainder of this section discusses various issues that may cause problems when reading ASCII files and provides suggestions on how to deal with these issues. The following topics are discussed:

Viewing ASCII files within Dataplot
Header lines/restricted rows or columns
Long data records
Automatic variable names
Reading fixed columns
Reading variables with unequal lengths
Reading character data
Reading row oriented data
Comment lines in data files
Reading Excel files
File name restrictions
Comma as decimal point
Missing values and undefined numbers
Reading date and time fields
Reading IP addresses
Reading monetary data (e.g., $23,461.58)
Reading numeric values with trailing "+" or "-" or "*" or "%"
Commas within character fields
Reading binary data
Reading image data
What if all the data will not fit into memory?

If you create the ASCII file yourself, it is recommended that you create it with variables of equal length (pick some numeric value to signify missing data) and with data items separated by one or more spaces. Inclusion of a header giving a description of the data file is optional, but we find it helpful (Dataplot can skip over the header lines). When the ASCII files are created by another program (e.g., Excel), then you may have less control over the format of the file. Hopefully, most ASCII files you encounter can be handled using the commands discussed below.

VIEWING THE ASCII FILE WITHIN DATAPLOT

In order to identify some of the issues discussed below, it is often helpful to view the ASCII file before trying to read it into Dataplot. You can do this with the command

LIST FILE.DAT

This will list the file 20 lines (you can change the number of lines with the SET LIST LINES command) at a time. You can then enter a carriage return to view the next 20 lines or a "no" to stop viewing the file.

For some of the commands given below, you need to either know approriate line numbers or column numbers.

To view the file with line numbers, enter the command

NLIST FILE.DAT

To identify appropriate columns, enter the command

RULER

This will identify the first 80 columns.

HEADER LINES/RESTRICTED ROWS OR COLUMNS

Many data files contain header lines at the beginning of the file that provide a description of the file. In order to skip over these lines, enter the command

SKIP N

where N identifies how many lines to skip.

Most of the sample data files that are distributed with Dataplot contain a line starting with hyphens ("---"). You can use the command

SKIP AUTOMATIC

for these files. Dataplot will skip all lines until a line starting with three or more hypens is encoutered.

In a related issue, if you want to restrict the read to certain rows in the file, you can enter the command

ROW LIMITS N1 N2

with N1 and N2 denoting the first and last rows to read, respectively.

You can also restrict the read to certain columns of the file using the command

COLUMN LIMITS C1 C2

with C1 denoting the first column to read and C2 the last column to read.

LONG DATA RECORDS

When reading from the keyboard, Dataplot restricts a single record to a maximum of 80 columns.

When reading from a file, Dataplot previously restricted a single record to a maximum of 132 columns. The March, 2003 version raised the default limit to 255 characters. In addition, the following command was added:

MAXIMUM RECORD LENGTH N

with N denoting the size of the largest record to be read.

Dataplot accepts values of N up to 9999. However, be aware that some Fortran compilers may impose their own limit. These limits tend not to be well documented, but with modern compilers they should be sufficiently large that this should not be a problem in practice.

If you specify a SET READ FORMAT command (discussed below), you do not need to specify the maximum record length.

AUTOMATIC VARIABLE NAMES

Dataplot normally reads variable names on the READ command. However, many ASCII files will have the name of the variables given directly in the file or Dataplot can assign the variable names automatically.

Specific methods include the following.

Many of the sample files in provided in the Dataplot installation use a syntax like
```
 Y     X1   X2
 ----------------
 <data values>
       
```
For these files, you can enter the commands
In this case, Dataplot will skip all lines until a line starting with three or more hypens is encountered. It will then backspace to the previous line and read the variable names from that line.
Many ASCII data files will have the variable names on the first line of the file. For these files, you can enter the commands
If you would like Dataplot to simply assign the variable names, enter the command
Dataplot will read the first line of the file to determine the number of variables. It will then assign the names X1, X2, and so on to the variable names.

Note that Dataplot's usual rules for variable names still apply. That is, a maximum of eight characters will be used and spaces will delimit variable names. The use of special (i.e., not a number and not an alphabetic character) characters is discouraged. You may need to edit the file if the variable names do not follow these rules (more than eight characters will simply be ignored, so the issue is more one of duplicate variable names in the first eight characters).

Note 2020/08: The following tweaks were made to the reading of variable names.

Previously, only the first 255 characters of the line were read. This has been extended to support the number of characters specified by the MAXIMUM RECORD LENGTH command (if this command is not given, the default remains 255).
Dataplot will now automatically strip spaces and other special characters out of the variable names. Specifically, only alphabetic characters (A-Z), numbers (0-9), and underscores are retained.
Dataplot only supports eight characters for variable names. This can lead to duplicate file names. To reduce the possibility of duplicate names, Dataplot does the following if a duplicate name is found.
- If the name has less than eight characters, a "Z" is appended to the end of one of the names. The right most name will be modified.
- If the name has eight characters exactly, the right most name will change the last character to a Z (or if that character is already a Z, then to a X).
- If blank names are encountered, these will be changed to Zxxx where "xxx" is a sequence number (i.e., if there are three blank names encountered, they wiil be set to Z1, Z2, and Z3).

READING FIXED COLUMNS

By default, Dataplot performs free format reads. That is, you do not need to line up the columns neatly. You do need to provide one or more spaces (tabs, commas, colons, semi-colons, parenthesis, or brackets can be used as well) between data fields.

Many data files will contain fixed fields. There are several reasons you may want or need to take advantage of these fixed fields rather than using a free format read.

If your data fields do not contain spaces (or some other delimiter) between data columns, you need to tell Dataplot how to interpret the columns.
In some cases, you may only want to read selected variables in the data file.
Using a formatted read can significantly speed up the reading of the data. If you have small or moderate size data files (say 500 rows or fewer), this is really not an issue. However, if you are reading 50,000 rows, you can significantly speed up the read by specifying the format.
If the data fields have unequal lengths, Dataplot will not interpret the data file correctly with a free format read. It assigns the data items in the order they are encountered to the variable names in the order they are given. Dataplot does not try to guess if a data item is missing based on the columns.
The issue of unequal lengths is discussed in detail in the next section.

There are two basic cases for fixed fields.

The data fields are justified by the decimal point.
In this case, you can use the
command to specify a Fortran-like format to read the file. Enter HELP READ FORMAT for details.
Using a formatted read is significantly faster than a free format read.
Many programs will write ASCII files with fixed columns, but the data fields will be either left or right justified rather than lined up by the decimal point.
In this case, you can use a special form of the COLUMN LIMITS command that was introduced with the January, 2004 version. Normally, the first and last columns to read are specified. However, you can now enter variables for the lower and upper limits as in the following example:
That is, if variables rather than parameters are specified, separate column limits are specified for each data field. In this case, the first data field is between columns 1 and 10, the second field is between columns 21 and 30, and the third field is between 41 and 50.
When this syntax is used, only one variable is read for each specified field. If the field is blank, then this is interpreted as a missing value.

READING VARIABLES OF UNEQUAL LENGTH (EMPTY FIELDS)

Dataplot typically expects all variables to be of equal length. That is, the data is rectangular with no empty fields.

Performing free format reads with space delimited data files when there are empty fields is problematic. Dataplot reads the file one row at a time. When reading a row, Dataplot will assign the first value read to the first variable name, the second value to second variable and so on. By default, the row with smallest number of values defines the number of variables that will be read. For example, if you requested four variables be read, but one row of the data file only has two values, then only two variables will be read into Dataplot.

If you have a data file where the columns have unequal lengths (i.e., empty fields), you can try one of the following things.

Pick some value to represent a missing value and fill in missing data points with that value. After reading the data, you can use a RETAIN command to remove them. For example, if you use -99 to signify a missing value, you can enter
Alternatively, you can use a SUBSET clause on subsequent plot and analysis commands.
There are two SET commands that pertain to missing values.
- SET DATA MISSING VALUE <value> specifies a character string that will be interpreted as a missing value in the data file (this character string can be a numeric value).
- SET READ MISSING VALUE <value> specifies the numeric value that will be saved to the Dataplot variable when a missing value (as defined by the SET DATA MISSING VALUE) is encountered.
When feasible, this is the recommended solution.
If your data file has consistent formats for the rows, then there are two possible solutions.
If the fields are justified by the decimal point so that a Fortran format statement can be applied, then you can use the SET READ FORMAT command. In this case, empty fields are read as zero. If zero can be a valid data value for one or more of your variables, then it can be ambiguous whether a zero in your variable denotes a valid data point or a missing value. The SET READ MISSING VALUE setting does not apply when the SET READ FORMAT is used.
Many spreadsheets have an option for saving data to a "fixed width" ASCII text file. In these cases, the fields are typically either right or left justified. However, the column for the decimal point will not be consistent so that the SET READ FORMAT command cannot be used. In this case, you can use the variable form of the COLUMN LIMITS command as described above. By default, when a blank field is encountered, it is set to zero. You can specify the value to use by entering the command
If your data has both columns of unequal length and inconsistent columns for given data fields, an alternative is to use a comma delimited data file. If there is no data between successive commas, this is treated as a missing value. The default is to assign a value of zero. Alternatively, you can use the SET READ MISSING VALUE command described above.
You can specify a delimiter other than a comma with the command
You can use the following command
When this command is used, if the number of values read on a row is less than the number of variables specified, then the values from the row are padded with missing values (as specified by the SET READ MISSING VALUE). For example, if you entered
and a particular row only had two values, then the first value will be assigned to X1 and the second value to X2. X3 and X4 will be assigned the missing value for that row.
This works if the empty fields are at the end. However, if the empty fields are not at the end, then the assignment of the data to the variables will not be what is expected. In this case, it is recommended that empty fields be coded with a missing value code.
NOTE: The default (SET READ PAD MISSING COLUMNS OFF) action was modified 2019/04. Previously, if this was off, the number of variables read was truncated to the number of values on the row(s) with the smallest number of values. This was changed so that the behavior of the OFF setting is similar to the ON setting. The difference is that for OFF, a warning message will be printed for rows that have fewer than the expected number of values.

The variable form of the COLUMN LIMITS, the SET READ MISSING VALUE, and the SET READ DELIMITER commands were introduced in the January, 2004 version. The interpretation of successive commas as a missing value was also introduced in the January, 2004 version.

READING DATA WITH CHARACTER FIELDS

Dataplot has not previously supported character data. The one execption is that you could read row labels with the READ ROW LABEL command (enter HELP READ ROW LABEL for details). If encountered, Dataplot would generate an error message and not read the data file correctly.

With the January 2004 version, we have introduced some limited support for character data. Specifically, we have added the command

SET CONVERT CHARACTER <ON/IGNORE/ERROR>

Setting this to ERROR will continue the current Dataplot action of reporting character data as an error. This is recommended for the case when a file is suppossed to contain only numeric data and the presence of character data is in fact indicative of an error in the data file.

Setting this to IGNORE will instruct Dataplot to simply ignore any fields containing character data. This can be useful if you simply want to extract the numeric data fields in the file without entering COLUMN LIMITS or SET READ FORMAT commands.

Setting this to ON will read character fields and write them to the file "dpzchf.dat". Note that Dataplot saves numeric data "in memory" for fast access. Since character data has limited use in Dataplot, we have decided to save character data externally to minimize memory requirements. Dataplot keeps a separate name table for the character data fields (the names for character variables are stored in the file "dpzchf.dat").

NOTE 2018/10: The CATEGORICAL option was added. This option works similarly to ON. However, in addition to creating the character variable in "dpzchf.dat", it also creates numerical variables automatically from the character data.

There are some restrictions on when Dataplot will try to read character data:

This only applies to the variable read case. That is, READ PARAMETER and READ MATRIX will ignore character fields or treat them as an error.
Dataplot will only try to read character data from a file. When reading from the keyboard (i.e., when READ is specified with no file name), character data will be ignored when a SET CONVERT CHARACTER ON is specified.
This capability is not supported for the SERIAL READ case.
The SET READ FORMAT command does not accept the "A" format specification for reading character fields.
A maximum of 20 character variables will be saved.
A maximum of 24 characters for each character variable will be saved.
The character variables from at most one data file will be saved in a given session.

Some of these restrictions may be addressed in subsequent releases of Dataplot.

Currently, Dataplot has limited support for character variables. Specifically,

The row label can be used for the plot character by entering the command
You can convert a character variable to a coded numeric variable with the command
with IX denoting the name of the character variable. These command assigns a numeric value for each unique name in the character variable.
For the CHARACTER CODE case, the coding is from 1 to K where K is the number of unique values. The order is based on the order these values are found in the file.
For the ALPHABETIC CHARACTER CODE case, the coding is from 1 to K where K is the number of unique values. The order is performed in alpabetical order.

We anticipate additional use of character variables in subsequent releases of Dataplot.

If your character fields contain non-numeric/non-alphabetic characters, then it is recommended that the character fields be enclosed in quotes. When Dataplot encounters a quote (either a single or double quote), it interprets everything until a matching quote is found as part of that character field. If the quotes are not used, then spaces, tabs, parenthesis, brackets, colons, and semi-colons are interpreted as delimiters that signify the end of that data item.

READING ROW ORIENTED DATA

Dataplot assumes a column oriented format. That is, a row of data represents a single record (or case) and a column of data represents a variable. If a data file has a row orientation, then this is reversed. A row of data represents a variable and a column of data represents a record (or case).

The following example shows one way of correctly reading the data into Dataplot. Suppose that your data file contains five rows with each row corresponding to a single variable. You can do the following:

NOTE 2018/10: Dataplot added a READ ROW command that will read each row into a separate column. This command assumes all of the data in a given row are numeric. It does not assume that all rows must contain the same number of elements.

COMMENT LINES IN DATA FILES

It is sometimes convenient to include comments in data files. If these comments are contained at the beginning of the file, then the SKIP command can be used. To have Dataplot check for comment lines in the data file, enter the command

COMMENT CHECK ON

The default comment character is a ".". That is, any line starting with a ". " is treted as a comment line and ignored. To specify a different comment character, enter the command

COMMENT CHARACTER <char>

with denoting the desired comment character.

EXCEL FILES

At the current time (1/2004), Dataplot does not support the direct reading of Excel data files. We are planning to add this capability in a future release of Dataplot. Until that time, you need to save the data in Excel to an ASCII file and read that ASCII file into Dataplot.

Excel provides the following options for writing ASCII data files:

Formatted text (space delimited) (.PRN extension)
This format will use consistent columns for the data fields. The variable form of the COLUMN LIMITS command can be used when the data columns have unequal length.
Character fields will often not have the separating space. The variable form of the COLUMN LIMITS command can be used in this case as well.
CSV (Comma delimited) (.CSV extension)
This format will separate data fields with a single comma. Missing data is represented with successive commas. Dataplot can now (as of the January 2004 version) handle this correctly.
Text (Tab delimited) (.TXT extension) Text (MS-DOS) (.TXT extension)
These files will separate data fields with a tab character. Note that Dataplot converts all non-printing characters (including tabs) to a single space character.
This format is not appropriate for data containing variables with unequal lengths since it will not generate consistent columns for the data fields. Use either the space delimited or comma delimited file for that case.

The 2014/12 version of Dataplot added the capability of reading and writing to the system clipboard under Windows. Using the "copy" function and Excel and then using the READ CLIPBOARD command in Dataplot will in many cases be the easiest way to retrieve data from Excel files. Enter HELP CLIPBOARD for details.

Note 2020/02: Dataplot added a READ EXCEL command. This command utilizes Python (and specifically the Pandas package) to read Excel files. Enter HELP READ EXCEL for details.

FILE NAME RESTRICTIONS

A few comments on file names.

File names are limited to 80 characters or less (this includes the path name if given).
If the file name contains either spaces or hypens, it should be enclosed in double quotes. For example,
The file name should be a valid file name on the local operating system.
The file name must contain a period "." in the file name itself or as a trailing character. Dataplot strips off trailing periods on those systems where it is appropriate to do so. On systems where trailing periods can be a valid file name (e.g., Unix), Dataplot first tries to open the file with the trailing period. If this fails, it will try to open the file name without the trailing period.
On systems where file names are case sensitive (i.e., Unix), Dataplot first tries to open the file name as given. If the file is not found, it then tries to match the file name after converting the name to all upper case characters. If it is still not found, it will convert the file name to all lower case characters
If your file name contains a mixture of upper and lower case characters, then you need to enter the case for the file name correctly on the READ command.

COMMA AS DECIMAL POINT

Dataplot follows the United States convention where the decimal point is the period ".". Some locales may use a different character to denote the decimal point. In particular, some countries use the comma ",".

To allow Dataplot to read files that use a character other than the "." for the decimal point, enter the command

SET DECIMAL POINT <value>

where <value> denotes the character that specifies the decimal point.

Note this support is fairly limited. Specifically, it applies to free-format reads (i.e., no SET READ FORMAT command has been entered). In addition,

This option is not supported for the WRITE command. WRITE will always use a period for the decimal point.
Dataplot alphanumeric output (e.g., the output from the FIT command) is generated with the period as the decimal point.
As mentioned above, if you read your data with a SET READ FORMAT command, the data must use the period for the decimal point.

MISSING VALUES AND UNDEFINED NUMBERS

Some software programs will have special characters to denote missing values or undefined values (e.g., the result of trying to divide by 0).

In particular, Unix/Linux software often uses "nan" to denote an undefined number. If Dataplot encounters an "nan" in a numeric field, it will convert it to the Dataplot "missing value". The "nan" search is not case sensitive (i.e., it will check for "NAN", "NaN", etc.). You can specify what Dataplot will use for the missing value by entering the command

SET READ MISSING VALUE <value>

where <value> is a numeric value.

Missing value flags are specific to individual programs. You can specify a character string that denotes a missing value with the command

SET DATA MISSING VALUE <value>

where <value> is a string with 1 to 4 characters. If Dataplot encounters <value> in a numeric field, it will convert it to the Dataplot "missing value". The missing value string is not case sensitive. You can specify what Dataplot will use for the missing value by entering the command

SET READ MISSING VALUE <value>

where <value> is a numeric value.

READING DATE AND TIME FIELDS

Date and time fields will typically have syntax like

Dataplot treats the "/" and ":" as indicating character fields (based on the SET CHARACTER CONVERT command, this will either cause an error, result in this field being ignored, or the field being read as a character variable).

The following commands were added (2016/06) to help deal with date and time fields.

Although Dataplot does not have explicit date or time variables, these commands allow the components of date and time fields to be read as separate numeric variables. For example,

READING IP ADDRESSES

IP addresses typically have a syntax like

129.6.37.209

By default, Dataplot will generate an error when trying to read a field of this type. To address this, you can enter the command

SET READ IP ADDRESSES ON

If this switch is ON, Dataplot will scan the line and if a field is encountered that conains more than one period ".", Dataplot will convert these periods to spaces before parsing the line.

The default is OFF since this adds additional processing time to the READ and most data sets do not contain IP addresses.

READING MONETARY DATA

Monetary data may sometimes be given as

$11,456.12 $1,021,111.10

The "$" and "," in these numeric fields will cause problems. The "$" will be treated as a non-numeric value (depending on other SET commands, this will be treated as an error or the numeric field will be read as a character field). The comma is typically treated as a field delimiter. If you have this kind of data, enter the commands

To reset the defaults, enter

Note that if you enter the SET READ COMMA IGNORE ON command, the comma will no longer be treated as the delimiter. Dataplot cannot currently handle the case where the comma is used both for monetary data and also as a field delimiter.

READING NUMERIC VALUES WITH TRAILING "+" OR "-" OR "*" OR "%"

On occassion, numeric fields may have a trailing "+", a trailing "-", a trailing "*" or a trailing "%".

The "+" is typically used to indicate that the value is greater than or equal to the entered value. Likewise, the "-" is used to indicate that the value is less than or equal to the entered value. This may be used when data is truncated at a high or low value. If you have data that uses this convention, enter

set read trailing plus minus ignore on

Dataplot does not have any convention for indicating that a number in fact means "greater than" or "less than", so it will read the numeric value and simply ignore the "+" or "-".

To reset the defualt, enter

set read trailing plus minus ignore off

Trailing asterisks ("*") are sometimes used to indicate statistical significance. To ignore these asterisks, enter

set read asterisk ignore on

If this command is not given, the field will be treated as a character field. To reset the default, enter

set read asterisk ignore off

Percentage data will sometimes include a trailing percent sign ("%"). To ignore the percent sign, enter

set read percent sign ignore on

If this command is not given, the field will be treated as a character field. To reset the default, enter

set read percent sign ignore off

COMMAS WITHIN CHARACTER FIELDS

If you are reading data that may contain character fields, you can specify whether you want commas in the character fields to be treated as part of the character field or as a delimiter.

To have the comma treated as a delimiter, enter

set character field comma delimiter on

To have the comma not be interpreted as a delimiter (i.e., it will simply be another character in the character field), enter

set character field comma delimiter off

The default is OFF.

READING BINARY DATA

Currently, the only types of binary data that Dataplot currently supports are:

A few types of image files can be read on some platforms. This is discussed in the next section.

Dataplot may be able to read some files created using Fortran unformatted data files. Dataplot is most likely to have success reading unformatted Fortran files that contain only numeric data and use a consistent record structure. Unformatted Fortran files that contain a mixture of character and numeric data will not be read successfully.

Support for other types of binary files may be added in future releases. However, this support will probably be for specific types of binary files as oppossed to arbitrary binary files.

The advantage of using unformatted Fortran files is that file sizes may be significantly smaller and reading the data can be significantly faster. One potential use of unformatted Fortran files is to save a large data file that you will read many times in Dataplot.

The disadvantages of using unformatted Fortran files are that they are not human readable, they cannot be edited or modified using an ASCII editor, and, most importantly, they are not portable between operating systems and compilers. That is, unformatted Fortran files typically need to be read using the same operating system and compiler that was used to create them.

For details on using unformatted Fortran files, enter

HELP SET READ FORMAT

READING IMAGE DATA

If Dataplot was built with support for the GD library, Dataplot can read image data in PNG, JPEG, or GIF format. If you have image data in another format, you may be able to use an image conversion program (e.g., NetPBM or ImageMagick) to convert it to one of the supported formats.

For further information, enter

HELP READ IMAGE

WHAT IF ALL THE DATA WILL NOT FIT INTO MEMORY?

Dataplot was designed primarily for interactive usage. For this reason, it reads all data into memory. The current default is to have a workspace that accomodates 10 columns with 1,500,000 rows (you can re-dimension to obtain more columns at the expense of fewer rows, however you cannot increase the maximum number of rows).

With the advent of "big data", there are more data files that cannot be read into Dataplot's available memory. For these data files, there are several things that can potentially be done

For some platforms, if you have a large amount of memory you may be able to build a version of Dataplot that raises the maximum number of rows. For example, on a Linux system with 64MB of RAM, we were able to build a version that supports a maximum of 10,000,000 rows. Contact Alan Heckert if you need assistance with this.
The STREAM READ command was added. This command uses one pass algorithms to do a number of things.
- You can create a new file that uses SET WRITE FORMAT. This is typically done once so that you can use SET READ FORMAT on subsequent reading of the data file (this can substantially speed up processing of these large files).
- You can generate various summary statistics either for the full data set or for groups in the data.
- You can generate cross tabulation statistics (up to 4 cross tabulation variables may be specified).
- You can create various types of distance (e.g., Euclidean distances, correlation distances) matrices either for the full data set or for cross tabulations of the data.
  Distance matrices are often used for various types of multivariate analysis.
- You can generate approximate percentiles either for the full data set or for cross tabulations of the data. Based on this, you can perform distributional modeling for a single variable or distributional comparisons between variables (e.g., quantile quantile plots, bihistograms, two sample KS tests, and so on).
The STREAM READ command can allow you to do a fair bit of exploratory analyses on these large data sets.