David Banks, Mark Levenson
Stephen Stein, Robert Schweitzer
The Statistical Engineering Division has had a long history of collaboration in chemical applications at NIST. As the technology of chemistry advances, so must the technology of statistics. Instrumentation that enables chemists to advance their field produces larger and more complex datasets than ever. Traditional statistics must be adapted and expanded to analyze such datasets. SED is now focusing on applying new statistical and data mining tools to a range of problems in the fields of chemistry and biotechnology. We hope to leverage experience gained on individual projects to projects across these fields.
Presently, we are focusing on the spectrometry modalities used in chemical and material identification. For example, the most commercially successful Standard Reference Database is the NIST/EPA/NIH Mass Spectral Database. This library contains the spectra of over 100,000 compounds. It is used by industry and government to help in the identification of compounds. We have identified several technical problems whose solutions would add value to the dataset product.
Mass spectrometry is only one example of the many emerging spectrometry modalities used in chemical and material identification. Other examples in which NIST is heavily involved include Gas Chromatography Mass Spectrometry (GCMS) and Fourier Transform-Near Infrared (FT-NIR). Each modality has its own particulars and applications. However, from a data analytic point of view, the spectrometry modalities share common aspects that may be exploited by statisticians.
In addition to the spectrometry area, we are discussing applying data mining techniques to the carbon dating of ice core data and to genomic data on the interaction of drugs and cancer cell lines.
Figure 20: The results of a cluster analysis of mass spectra.
Date created: 7/20/2001