|
ASCII FILESThis section provides guidance on reading ASCII data files in Dataplot. This includes discussion of some commands added to the 1/2004 version of Dataplot. In particular, discussion is included for ASCII files created by the Excel program.Dataplot has limited support for binary data files. Currently, only binary files created using Fortran unformatted WRITE are supported. Enter HELP SET READ FORMAT for details. Also, Dataplot does not currently support directly reading files from other statistical/spreadsheet programs or database files. Some support may be provided in future releases, but for now you need to save the data from these programs in an ASCII file in order to read them into Dataplot. XML based data files are becoming increasingly popular as well. At this time, Dataplot does not support XML based data files, although we anticipate looking at this issue for subsequent releases. IDEAL CASE By default, Dataplot assumes rectangular data files containing numeric data where the data columns are separated by one or more spaces, commas, or tabs. In this case, you can read the file with a command like the following:
The first argument after the READ is the name of the ASCII file. The remaining arguments identify the variable names. Variable names can be up to eight characters long and should be limited to alphabetic (A-Z) and numeric (0-9) characters. Although other characters can in fact be used, this is discouraged since their use can cause problems in some contexts. Variable names are not case sensitive (Dataplot converts all alphabetic characters to upper case). Variable names are separated with one or more spaces (commas are not allowed as delimiters in this context). Dataplot recognizes the first argument as a file name if it finds a "." in the name. If no "." is found, Dataplot assumes the first argument is a variable name and it tries to read from the keyboard rather than the file. The remainder of this section discusses various issues that may cause problems when reading ASCII files and provides suggestions on how to deal with these issues. The following topics are discussed:
If you create the ASCII file yourself, it is recommended that you create it with variables of equal length (pick some numeric value to signify missing data) and with data items separated by one or more spaces. Inclusion of a header giving a description of the data file is optional, but we find it helpful (Dataplot can skip over the header lines). When the ASCII files are created by another program (e.g., Excel), then you may have less control over the format of the file. Hopefully, most ASCII files you encounter can be handled using the commands discussed below. VIEWING THE ASCII FILE WITHIN DATAPLOT In order to identify some of the issues discussed below, it is often helpful to view the ASCII file before trying to read it into Dataplot. You can do this with the command
This will list the file 20 lines (you can change the number of lines with the SET LIST LINES command) at a time. You can then enter a carriage return to view the next 20 lines or a "no" to stop viewing the file. For some of the commands given below, you need to either know approriate line numbers or column numbers. To view the file with line numbers, enter the command
To identify appropriate columns, enter the command
This will identify the first 80 columns. HEADER LINES/RESTRICTED ROWS OR COLUMNS Many data files contain header lines at the beginning of the file that provide a description of the file. In order to skip over these lines, enter the command
where N identifies how many lines to skip. Most of the sample data files that are distributed with Dataplot contain a line starting with hyphens ("---"). You can use the command
for these files. Dataplot will skip all lines until a line starting with three or more hypens is encoutered. In a related issue, if you want to restrict the read to certain rows in the file, you can enter the command
with N1 and N2 denoting the first and last rows to read, respectively. You can also restrict the read to certain columns of the file using the command
with C1 denoting the first column to read and C2 the last column to read. When reading from the keyboard, Dataplot restricts a single record to a maximum of 80 columns. When reading from a file, Dataplot previously restricted a single record to a maximum of 132 columns. The March, 2003 version raised the default limit to 255 characters. In addition, the following command was added:
with N denoting the size of the largest record to be read. Dataplot accepts values of N up to 9999. However, be aware that some Fortran compilers may impose their own limit. These limits tend not to be well documented, but with modern compilers they should be sufficiently large that this should not be a problem in practice. If you specify a SET READ FORMAT command (discussed below), you do not need to specify the maximum record length. Dataplot normally reads variable names on the READ command. However, many ASCII files will have the name of the variables given directly in the file or Dataplot can assign the variable names automatically. Specific methods include the following.
Note that Dataplot's usual rules for variable names still apply. That is, a maximum of eight characters will be used and spaces will delimit variable names. The use of special (i.e., not a number and not an alphabetic character) characters is discouraged. You may need to edit the file if the variable names do not follow these rules (more than eight characters will simply be ignored, so the issue is more one of duplicate variable names in the first eight characters). Note 2020/08: The following tweaks were made to the reading of variable names.
By default, Dataplot performs free format reads. That is, you do not need to line up the columns neatly. You do need to provide one or more spaces (tabs, commas, colons, semi-colons, parenthesis, or brackets can be used as well) between data fields. Many data files will contain fixed fields. There are several reasons you may want or need to take advantage of these fixed fields rather than using a free format read.
There are two basic cases for fixed fields.
READING VARIABLES OF UNEQUAL LENGTH (EMPTY FIELDS) Dataplot typically expects all variables to be of equal length. That is, the data is rectangular with no empty fields. Performing free format reads with space delimited data files when there are empty fields is problematic. Dataplot reads the file one row at a time. When reading a row, Dataplot will assign the first value read to the first variable name, the second value to second variable and so on. By default, the row with smallest number of values defines the number of variables that will be read. For example, if you requested four variables be read, but one row of the data file only has two values, then only two variables will be read into Dataplot. If you have a data file where the columns have unequal lengths (i.e., empty fields), you can try one of the following things.
The variable form of the COLUMN LIMITS, the SET READ MISSING VALUE, and the SET READ DELIMITER commands were introduced in the January, 2004 version. The interpretation of successive commas as a missing value was also introduced in the January, 2004 version. READING DATA WITH CHARACTER FIELDS Dataplot has not previously supported character data. The one execption is that you could read row labels with the READ ROW LABEL command (enter HELP READ ROW LABEL for details). If encountered, Dataplot would generate an error message and not read the data file correctly. With the January 2004 version, we have introduced some limited support for character data. Specifically, we have added the command
Setting this to ERROR will continue the current Dataplot action of reporting character data as an error. This is recommended for the case when a file is suppossed to contain only numeric data and the presence of character data is in fact indicative of an error in the data file. Setting this to IGNORE will instruct Dataplot to simply ignore any fields containing character data. This can be useful if you simply want to extract the numeric data fields in the file without entering COLUMN LIMITS or SET READ FORMAT commands. Setting this to ON will read character fields and write them to the file "dpzchf.dat". Note that Dataplot saves numeric data "in memory" for fast access. Since character data has limited use in Dataplot, we have decided to save character data externally to minimize memory requirements. Dataplot keeps a separate name table for the character data fields (the names for character variables are stored in the file "dpzchf.dat"). NOTE 2018/10: The CATEGORICAL option was added. This option works similarly to ON. However, in addition to creating the character variable in "dpzchf.dat", it also creates numerical variables automatically from the character data. There are some restrictions on when Dataplot will try to read character data:
Some of these restrictions may be addressed in subsequent releases of Dataplot. Currently, Dataplot has limited support for character variables. Specifically,
We anticipate additional use of character variables in subsequent releases of Dataplot. If your character fields contain non-numeric/non-alphabetic characters, then it is recommended that the character fields be enclosed in quotes. When Dataplot encounters a quote (either a single or double quote), it interprets everything until a matching quote is found as part of that character field. If the quotes are not used, then spaces, tabs, parenthesis, brackets, colons, and semi-colons are interpreted as delimiters that signify the end of that data item. Dataplot assumes a column oriented format. That is, a row of data represents a single record (or case) and a column of data represents a variable. If a data file has a row orientation, then this is reversed. A row of data represents a variable and a column of data represents a record (or case). The following example shows one way of correctly reading the data into Dataplot. Suppose that your data file contains five rows with each row corresponding to a single variable. You can do the following:
SERIAL READ FILE.DAT X^K NOTE 2018/10: Dataplot added a READ ROW command that will read each row into a separate column. This command assumes all of the data in a given row are numeric. It does not assume that all rows must contain the same number of elements. It is sometimes convenient to include comments in data files. If these comments are contained at the beginning of the file, then the SKIP command can be used. To have Dataplot check for comment lines in the data file, enter the command
The default comment character is a ".". That is, any line starting with a ". " is treted as a comment line and ignored. To specify a different comment character, enter the command
with
At the current time (1/2004), Dataplot does not support the
direct reading of Excel data files. We are planning to add
this capability in a future release of Dataplot. Until that
time, you need to save the data in Excel to an ASCII file and
read that ASCII file into Dataplot.
Excel provides the following options for writing ASCII data
files:
This format will use consistent columns for the data fields.
The variable form of the COLUMN LIMITS command can be used
when the data columns have unequal length.
Character fields will often not have the separating space. The
variable form of the COLUMN LIMITS command can be used in this
case as well.
This format will separate data fields with a single comma.
Missing data is represented with successive commas. Dataplot
can now (as of the January 2004 version) handle this correctly.
These files will separate data fields with a tab character.
Note that Dataplot converts all non-printing characters
(including tabs) to a single space character.
This format is not appropriate for data containing variables
with unequal lengths since it will not generate consistent
columns for the data fields. Use either the space delimited
or comma delimited file for that case.
The 2014/12 version of Dataplot added the capability of reading
and writing to the system clipboard under Windows. Using the
"copy" function and Excel and then using the READ CLIPBOARD command
in Dataplot will in many cases be the easiest way to retrieve
data from Excel files. Enter HELP CLIPBOARD
for details.
Note 2020/02: Dataplot added a READ EXCEL command. This command
utilizes Python (and specifically the Pandas package) to read Excel
files. Enter HELP READ EXCEL for details.
A few comments on file names.
If your file name contains a mixture of upper and lower case
characters, then you need to enter the case for the file name
correctly on the READ command.
Dataplot follows the United States convention where the decimal
point is the period ".". Some locales may use a different
character to denote the decimal point. In particular, some
countries use the comma ",".
To allow Dataplot to read files that use a character other than
the "." for the decimal point, enter the command
where <value> denotes the character that specifies the decimal
point.
Note this support is fairly limited. Specifically, it applies
to free-format reads (i.e., no SET READ FORMAT command has been
entered). In addition,
MISSING VALUES AND UNDEFINED NUMBERS
Some software programs will have special characters to denote
missing values or undefined values (e.g., the result of trying
to divide by 0).
In particular, Unix/Linux software often uses "nan" to denote an
undefined number. If Dataplot encounters an "nan" in a numeric
field, it will convert it to the Dataplot "missing value". The "nan"
search is not case sensitive (i.e., it will check for "NAN", "NaN",
etc.). You can specify what Dataplot will use for the missing value
by entering the command
where <value> is a numeric value.
Missing value flags are specific to individual programs. You can
specify a character string that denotes a missing value with the
command
where <value> is a string with 1 to 4 characters. If Dataplot
encounters <value> in a numeric field, it will convert it to the
Dataplot "missing value". The missing value string is not case
sensitive. You can specify what Dataplot will use for the missing
value by entering the command
where <value> is a numeric value.
Date and time fields will typically have syntax like
The following commands were added (2016/06) to help deal with date and
time fields.
Although Dataplot does not have explicit date or time variables,
these commands allow the components of date and time fields to
be read as separate numeric variables. For example,
IP addresses typically have a syntax like
By default, Dataplot will generate an error when trying to read a
field of this type. To address this, you can enter the command
If this switch is ON, Dataplot will scan the line and if a field is
encountered that conains more than one period ".", Dataplot will
convert these periods to spaces before parsing the line.
The default is OFF since this adds additional processing time to
the READ and most data sets do not contain IP addresses.
Monetary data may sometimes be given as
The "$" and "," in these numeric fields will cause problems. The
"$" will be treated as a non-numeric value (depending on other
SET commands, this will be treated as an error or the numeric field
will be read as a character field). The comma is typically treated
as a field delimiter. If you have this kind of data, enter the
commands
To reset the defaults, enter
Note that if you enter the SET READ COMMA IGNORE ON command, the
comma will no longer be treated as the delimiter. Dataplot cannot
currently handle the case where the comma is used both for monetary
data and also as a field delimiter.
READING NUMERIC VALUES WITH TRAILING "+" OR "-" OR "*" OR "%"
On occassion, numeric fields may have a trailing "+", a trailing "-",
a trailing "*" or a trailing "%".
The "+" is typically used to indicate that the value
is greater than or equal to the entered value. Likewise, the "-" is
used to indicate that the value is less than or equal to the entered
value. This may be used when data is truncated at a high or low value.
If you have data that uses this convention, enter
Dataplot does not have any convention for indicating that a number
in fact means "greater than" or "less than", so it will read the
numeric value and simply ignore the "+" or "-".
To reset the defualt, enter
Trailing asterisks ("*") are sometimes used to indicate statistical
significance. To ignore these asterisks, enter
If this command is not given, the field will be treated as a
character field. To reset the default, enter
Percentage data will sometimes include a trailing percent sign
("%"). To ignore the percent sign, enter
If this command is not given, the field will be treated as a
character field. To reset the default, enter
COMMAS WITHIN CHARACTER FIELDS
If you are reading data that may contain character fields, you can
specify whether you want commas in the character fields to be
treated as part of the character field or as a delimiter.
To have the comma treated as a delimiter, enter
To have the comma not be interpreted as a delimiter (i.e., it
will simply be another character in the character field), enter
The default is OFF.
Currently, the only types of binary data that Dataplot currently
supports are:
Support for other types of binary files may be added in future
releases. However, this support will probably be for specific
types of binary files as oppossed to arbitrary binary files.
The advantage of using unformatted Fortran files is that file sizes
may be significantly smaller and reading the data can be significantly
faster. One potential use of unformatted Fortran files is to save
a large data file that you will read many times in Dataplot.
The disadvantages of using unformatted Fortran files are that they
are not human readable, they cannot be edited or modified using an
ASCII editor, and, most importantly, they are not portable between
operating systems and compilers. That is, unformatted Fortran files
typically need to be read using the same operating system and compiler
that was used to create them.
For details on using unformatted Fortran files, enter
If Dataplot was built with support for the GD library, Dataplot
can read image data in PNG, JPEG, or GIF format. If you have
image data in another format, you may be able to use an image
conversion program (e.g., NetPBM or ImageMagick) to convert it
to one of the supported formats.
For further information, enter
Dataplot was designed primarily for interactive usage. For this reason,
it reads all data into memory. The current default is to have a
workspace that accomodates 10 columns with 1,500,000 rows (you can
re-dimension to obtain more columns at the expense of fewer rows, however
you cannot increase the maximum number of rows).
With the advent of "big data", there are more data files that cannot be
read into Dataplot's available memory. For these data files, there are
several things that can potentially be done
Distance matrices are often used for various types of
multivariate analysis.
The STREAM READ command can allow you to do a fair bit of
exploratory analyses on these large data sets.
Privacy
Policy/Security Notice
NIST is an agency of the U.S.
Commerce Department.
Date created: 07/07/2004 |