SED navigation bar go to SED home page go to SED publications page go to NIST home page SED Home Page SED Contacts SED Projects SED Products and Publications Search SED Pages

contents     previous     next

3.3.6 Data Mining in Chemistry and Biotechnology

David Banks, Mark Levenson
Statistical Engineering Division, ITL

Stephen Stein, Robert Schweitzer
Physical and Chemical Properties Division, CSTL

Lloyd Currie
Surface and Microanalysis Science Division, CSTL

The Statistical Engineering Division has had a long history of collaboration in chemical applications at NIST. As the technology of chemistry advances, so must the technology of statistics. Instrumentation that enables chemists to advance their field produces larger and more complex datasets than ever. Traditional statistics must be adapted and expanded to analyze such datasets. SED is now focusing on applying new statistical and data mining tools to a range of problems in the fields of chemistry and biotechnology. We hope to leverage experience gained on individual projects to projects across these fields.

Presently, we are focusing on the spectrometry modalities used in chemical and material identification. For example, the most commercially successful Standard Reference Database is the NIST/EPA/NIH Mass Spectral Database. This library contains the spectra of over 100,000 compounds. It is used by industry and government to help in the identification of compounds. We have identified several technical problems whose solutions would add value to the dataset product.

  • From the mass spectrum of an unknown compound, determine the rigorous probability that the compound matches a given reference compound in the SRD MS database. The methodology would be developed to accommodate the differing assumptions and inputs of various settings.

  • From the mass spectrum of a mixture of compounds, identify compounds that are likely constituents.

  • From a compound with a known chemical structure, predict the resulting mass spectrum.

Mass spectrometry is only one example of the many emerging spectrometry modalities used in chemical and material identification. Other examples in which NIST is heavily involved include Gas Chromatography Mass Spectrometry (GCMS) and Fourier Transform-Near Infrared (FT-NIR). Each modality has its own particulars and applications. However, from a data analytic point of view, the spectrometry modalities share common aspects that may be exploited by statisticians.

In addition to the spectrometry area, we are discussing applying data mining techniques to the carbon dating of ice core data and to genomic data on the interaction of drugs and cancer cell lines.


Figure 20: The results of a cluster analysis of mass spectra.

contents     previous     next

Date created: 7/20/2001
Last updated: 7/20/2001
Please email comments on this WWW page to