IV зимний симпозиум по хемометрике 1518 февраля, 2005, Москва (Черноголовка), Россия |
|
The H-principle (the Heisenberg principle of mathematical modeling) is a recommendation of how mathematical models should be solved, when data are uncertain. It is recommended to carry out the modeling task in steps, where at each step the data (variables) are weighted according to their contribution to the aim of the modeling task. The primary interest in science and industry is the prediction aspect of the model. There are many ways to evaluate the prediction ability of a model. In the present work a cross-validation is used. 10% of the samples are selected at random and eliminated. The estimation of unknown parameters is carried out for the 90% of the samples. The response values of the 10% left out are then estimated using the estimated parameters. This is carried out 30 times. This procedure will tell fairly well of the prediction quality of the model.
Associated with the H-principle there have been developed a large collection of algorithms that have a common name of the 'H-method'. The H-method looks at the weight vectors and chooses the ones that contribute to the modeling task. The remaining weights are set to zero. In the case of linear regression the H-method gives results that have been optimized with respect to the prediction aspect of the model. Thus it gives better prediction than other traditional methods (PLS regression, Principal Component Regression, least squares regression and others).
Many methods in applied sciences use a positive definite matrix a starting point for finding the solution. Examples are Ridge regression, variance components in experimental design, Kalman filter methods and others. The H-method has been extended to these areas. It will be shown by examples that the H-method is superior to the traditional methods that use the full rank solution. Also, it is better to work with the solution obtained by the H-method, when carrying out the significance testing.
Estimation in non-linear models is traditionally carried out by the Gauss-Newton method, supplemented by Marquardt regularization. It will be shown by an example that the H-method gives better results that the traditional method. This holds both for the predictions derived from the model and also for the interpretation of parameters.
In many areas the classification of samples are important. New type of weight vectors have been defined such that the H-method can be used for classification of data. Applications of this method to spectral data have given superior results compared to traditional methods.
There is a tradition to use differentials and differential equations, when it is needed to describe changes of systems in time. The H-method has been extended to Path Modeling, where at each 'knot' (time point, status etc) there is given a data block, a matrix, that describes the situation. The H-method gives regression coefficients between the data blocks in a similar way as in linear regression, where there are only two data blocks. It is more efficient and more informative to use Path Modeling than the traditional approach by differential equations. An example of the usage of the H-method will be shown.
An important aspect of the H-method is that the vectors computed at each step have some special interpretations. Thus, using graphic analysis the user can get important insight into the latent variation of the data. These graphic features are available for all areas, where the H-method has been applied, both linear and non-linear models.
The H-method has been extended to many areas of applied sciences. For orientation visit the author's personal website.
Keywords: multivariate calibration, interval estimations, SIC-method, linear programming, object status classification.
Simple Interval Calculation (SIC) is a method for linear modeling and for prediction interval estimation in the multivariate calibration (MVC) problem
It is shown that SIC leads to results that are in a convenient interval form, and which account for all uncertainties present (X measurement errors, y measurement errors, bilinear modeling errors). The SIC approach also provides wide possibilities for leverage-type object status classification. This method is based on the single assumption that all errors involved in MVC are finite. In this aspect SIC differs from traditional chemometric methods used for multivariate data analyses and therefore is hardly apprehended by analytics.
In the presentation we discuss the finiteness of error. It is known, that assumption of normal error distribution is a commonplace in the conventional data analysis. Sometimes this is expressed implicitly, but often this is assumed by default. However, the researchers do not connect the normality of error distribution with its unboundedness. Does anybody take into account the data points, which are located beyond four standard deviations (4σ)? On the other hand, the amount of data in the modern data mining is often greater than 10+6. Therefore, from the statistical point of view, there should be 20-30 values that lie beyond 4σ. Where are they? The answer is that if such values occurred, they are excluded before data processing. We consider that in a real case study it should be assumed that all error distributions are truncated on 4σ or may be even on 3σ. Just this simple idea leads to the very drastic outcomes that throw a new light on the old MVC problem.
We are explaining the SIC approach using the simplest and familiar examples. Our goal is to introduce the SIC method in parallel with the traditional regression approach in order to emphasize both common and extraordinary SIC features. In the description we use a simulated data set that illustrates the SIC technique and adumbrate the main SIC concepts and postulates. We also demonstrate the SIC application to a well-known real world data set that is the octane rating procedure based on the near infrared spectroscopy.
An example from archaeometry deals with fatty acid concentration data, measured on tissue samples from human mummies, including the 5000-years old Tyrolean Iceman, as well as 2000-years old corpses found in a permafrost area in Siberia. In another example, PLS mapping of concentration profiles of triterpenoids has been applied to obtain information about the botanical origin of a black glue material found on Neolithic weapons and tools.
Since March 2004 the Rosetta mission of the European Space Agency is cruising towards a comet. One of the instruments on board is a secondary-ion time-of-flight mass spectrometer (TOF-SIMS) dedicated to investigate comet dust particles. Currently available laboratory mass spectra from organic and inorganic reference compounds - presumably relevant for the analysis of comet material - have been evaluated by chemometric methods.
In the univariate context, prediction uncertainty is quantified by a sample-specific standard error of prediction. Unfortunately, multivariate models are inherently much more complex than their univariate analogues. Monte Carlo simulation techniques such as the bootstrap and the noise addition method can give an estimate of this uncertainty but also some approximate mathematical expressions have been proposed in the literature. One of these proposals is the correction made by De Vries and Ter Braak (1995) on the expression derived by Martens (1989) and used in the Unscrambler® software package. Another proposal is the simplification of Faber and Bro (2002) of an expression derived earlier under the errors-in-variables (EIV) model (1996).
The purpose of this study is to show how these two proposals work and to assess their results from the ones obtained using the bootstrapping and the noise addition methods. This study has been performed using embedded near-infrared (NIR) data sets. This kind of data is produced by a NIR instrument installed on a forage plot harvester. Thanks to this instrument the collection, compression and scanning of forage samples is performed during harvesting. In such a way, properties as the dry matter content in forages, for instance, can be measured without having to handle the samples or transport them to a laboratory.
Analytical chemistry deals with accurate and precise determination of salient analyte concentrations etc. Within this science there is complete control of the attendant analytical uncertainty, but this is only a reflection of the miniscule analytical mass/volume however. What is the relationship to the lot from which the samples were taken? The lot is typically 10E3 to 10E6 times larger than the analytical mass - so the issue of how to conduct proper representative mass reduction comes to the fore, but there is no answer from within traditional analytical chemistry.
The Theory of Sampling (TOS) is the only complete scientific approach to representative sampling of heterogeneous materials. TOS provides a fundamental understanding of all the factors involved in producing sampling errors (there are seven fundamental types), which are by far the largest contributor to the total uncertainty budget of any analytical method. Typically the total sampling error are some 100 times larger than the analytical error(sic). In this context the specific analytical errors (which are often taken to equate the elusive "measurement errors") are reduced into oblivion! What is the impact of such overwhelming sampling errors on chemometrics and data analytical modeling and prediction etc?
It is actually necessary to invoke the science of metrology (the science of measurement) for a complete, systematic understanding of these complex interrelationships, but for a first-order introduction, this lecture presents a new multivariate perspective on the Theory of Sampling.
In spite of simplicity of the algorithm, it allows to successfully eliminate the background even in complex cases. The approach can be effectively applied to hyphenated chromatographic data such as HPLC-DAD or, more generally, to any hyphenated technique that apply a spectroscopic detector to control processes evolving in time. In the present work the method performance as a preprocessing stage for resolving mixture components is shown using several experimental datasets from two hyphenated techniques: HPLC-DAD and TGA-IR.
The process automation of direct AEA is a complex and insufficiently formalized problem. The spectrograph computerization allows us using 30-100 of spectral lines. But there is no a software, which provides us with handling of this exhaustive data in the knowledge base, like the analyst does. Analyzing the visual interpretation properties we can conclude that this procedure can automatized using the expert system paradigm. Actually, the direct AEA includes all the characteristics the expert system asks, i.e.:
Our report discusses a general scheme of the expert system and we try to give mathematical models and algorithms for some expert system components. In particular, we describe a methodology for fitting the optimal parameters of analysis based on multi-objective optimization. We also consider questions concerned with the calibration using methods of multivariate analysis such that the principal compound analysis and neuron networks. This report focuses on the formal mathematical method description illustrated with practical examples.
This software contains knowledge base and database, which are connected with one another by calculation modules. Registered spectra of calibration, CRM and unknown samples are collected in database for every analytical technique. Knowledge base and calculation modules do rapidly the primary methodological studies such as:
Optimization of spectral information process is realized with performance criteria of analytical results.
OES-AD SOFTWARE PACKAGE is applied to obtain and process for techniques of impurity determining in quartz and crystalline silicon and silanes; Au, Ag and noble metals in nonsoluble carbonaceous substance from black shales; lanthanum and yttrium in barium fluoride crystals; boron and tin in granitoids.
Responses of multi-sensor system to the exposure of nitromethane vapours in a wide range of concentrations (0.005 - 0.075 g/m3) have been investigated; this interval includes maximum permissible concentration of nitromethane in the working area (0.03 g/m3).
It is possible to make a precise determination of nitromethane concentrations in the air using a set of sensors composed of piezosensors. Responses from a set of sensors are applied to the input layer of artificial neural network. In order to reduce the error of determination the structures of an artificial neural network and its parameters have been optimized. Conditions of determination of nitromethane with the use of an artificial neural network have been found in the work: these are 7 neurons in the buried layer; impulse is 0.9; learning rate is 0.1; algorithm of preliminary processing is lg ( |Fgas /Fmod| ). To minimize an error of nitromethane determination learning algorithm for neural network has been applied called as "error backpropagation".
A set of sensors is quite portable and it can be applied in the intelligence systems possessing abilities for independent estimation of the environment, detection of certain components, and their recognition and instruction issue for corresponding execution units.
Of promising analytical tools for such applications are the 'Electronic Tongue' multisensor systems (ET). The idea of the electronic tongue is based on utilization of an array of cross-sensitive chemical sensors combined with multivariate data analysis. Such approach gives an opportunity of performing simultaneous quantitative determination of various substances in multi-component media as well as of integral monitoring (follow-up) of industrial processes as a whole. Since cross-sensitive sensors produce complex non-selective responses in multicomponent media, the use of proper data processing techniques is of paramount importance for successful application of the system. Different pattern recognition and multivariate calibration methods can be employed for these purpose, Principal Component Analysis (PCA) and Partial Least Square regression (PLS) being probably the most common. In the present paper two multivariate calibration methods that is PLS regression and Back-propagation artificial neural network (ANN) were applied and their performance was compared for the processing of the electronic tongue data from monitoring of the fermentation growth media.
Measurements with the electronic tongue were made in simulated fermentation media closely resembling real-world samples typical of the production process involving Aspergillus niger with the aim of simultaneous determination of ammonium, oxalate and citrate content. Two multivariate calibration techniques namely PLS regression and back-propagation neural network (ANN) were employed to produce calibration models. Optimization of the ANN configuration for the given task was performed, which included optimization of number of input signals and hidden neurons and choice of data preprocessing method. It was found that ANN produced somewhat better results than PLS in the data fitting, most likely due better consideration of significant non-linearity in the sensor responses particularly at low concentration levels of detected components. The average prediction errors for independent test set solutions were in the range from 5 to 8% for the three target components. The electronic tongue shows promise for fermentation monitoring and industrial applications based on good precision and reproducible behavior of the sensor system and adequate data processing.
We will explain the application of Fourier transform infrared-attenuated total reflection (FTIR - ATR) imaging in tablet research, and show a number of applications. The emphasis will be on the coupling of FTIR - ATR imaging with conventional dissolution test, the behaviour of soluble and poorly soluble drugs in tablets, the application of different ATR crystals, and of course the multivariate techniques used to process to large quantities of data obtained.
The changes in the UV-Vis absorption spectra of the zirconocene-based catalytic systems rac-Me2Si(2-Me,4-PhInd)2ZrCl2/polymethylalumoxane (MAO) and Ph2CCpFluZrCl2/MAO in toluene, as observed upon the variation of the AlMAO/Zr molar ratio over the range 0-3000 mol/mol were studied using statistical methods. Application of principal component analysis made it possible to determine the number of the light-absorbing reaction products in each system as well as to ascertain the general trends in the chemical processes taking place in these systems. The UV-Vis absorption spectra of the intermediate reaction products were estimated using the parameterized matrix self-modelling method on the basis of the supposed reaction model.
Image processing and analysis is used in the wide range of scientific and industrial applications. Often, it is possible to obtain some information about different properties of materials, liquids, mixtures etc., with the help of analysis of their images bypassing the instrumental measurements. Thereby the development of image classification methods is very actual. Image classification methods, that used widely at the present time use different approaches: statistics, morphology, fractal analysis, Fourier and wavelet transforms and so on. But the choice of specific classification methods depends on the types of images very much.
In this work, the problem of using multivariate data analysis methods for image classification is considered. The results of classification of different types of images (heterogeneous, homogeneous non-periodic and homogeneous quasi-periodic) using Fourier and wavelet transforms and Angle Measure Technique are compared.
Whenever samples are extracted from a large set of samples, the representativity of the subset should be achieved. It is known that representativity is a vague term, which can be interpreted in different way. The results for representativity depend not on the data itself but also on the working variable-subspace, which in its turn depends on the calibration model, i.e. the number of PLS-components. We will consider two different situations. First, a set of samples should be split into training and test sets and we want to verify that these subsets are representative for each other. Different statistical tools are used for this purpose.
On the other hand, the test set is intensively used for calculation of the Root Mean Square Error of Prediction, which serves for evaluation of the prediction ability of the model. However, this quality measure should be used with a care as it greatly depends on the "quality" of this test set.
In the second case we want to select the most important objects among the training set and use this subset for model construction without significantly compromising the prediction ability of the model. Such a subset shall satisfy two opposing requirements: 1) it should be of maximal representativity with respect to the entire set, but 2) it should simultaneously be noticeably smaller than the total set. Here, the SIC object classification is used as a main tool. The presented results are compared with Kennard-Stone algorithm.
The analysis of experimental errors shows that in both cases of direct and indirect measurements the resultant error should be considered as a bounded value. The model of interval-bounded errors has become a subject of intensive study in model building and design of experiment during last decades.
Using the "black box" model we assume that the measurement error for the output variable is interval-bounded and the errors for the input variables are negligible.
Each record in the table of measurements with bounded error in output variable allows us to formulate a constraint on model parameters. All the values of model parameters which are consistent with all constraints form the set of possible parameters values also called uncertainty set or informational set. The emptiness of the uncertainty set means inconsistency of the collected empirical information. The presence of outliers in the processed data is one of possible reasons of contradictions in dataset.
The core idea of the proposed outlier detection method is as follows. An outlier caused by a blunder may be treated as a value measured with the underestimated error, i.e. the actual measurement error is greater than the declared error. In order to correct the outlier, it is necessary to find out the lowest bound of its possible actual error, which makes the corrected observation consistent with the others. Comparing the values of the lowest bound of possible actual error to the values of the declared observation error allows us to make some inferences concerning the degree of inconsistency of outlier to the whole dataset.
In the case of the linear parameterized model the problem of finding the lowest bound of possible actual error may be stated as a linear programming problem.
The proposed technique is applied to solving the problem of geometric correction of satellite images.
A multilevel approach to the development of integrated waste utilization technologies has been suggested. It includes informational, physicochemical, technological, and integral blocks. Two basic principles are used for the creation of integrated utilization methods in the most complicated cases such as multicomponent heterogeneous wastes: the principle of phase redistribution and that of composite component correspondence. The principle of phase redistribution implies phase transformation processes resulting in the products, composition of which is satisfactory for further utilization of serves for neutralization of dangerous components. The principle of composite component correspondence requires that compounds of all material flows, when playing their roles in the process of the waste transformation or acting as a ballast, to fully correspond by their properties to the general transformation route and the quality of the final product.
The multilevel approach to the problem of analysis and processing of wastes makes it possible to convert multicomponent heterogeneous toxic wastes into ecologically safe and utilizable products.
At present, this problem is avoided by fulfilling individual calibrations of analyzers. At the same time, the development of methods enabling the calibration models transfer within instruments of the same type essentially diminishes labor expenditures and, thus, allows widening ranges of application of NIR-analyzers. Now only one group out of different methods of calibration models transfer is used namely the methods, in which the spectra measured on a secondary instrument are corrected so that they conform to spectra measured on the master instrument. Then begins possible to use the calibration model of the master instrument at the secondary one. In all methods of calibration models transfer the spectra correction is executed over a partial set of samples that is significant smaller, than at usual calibration.
However, at such transfer in case of appearance samples of new types of organic matters (for example the wheat of new crop or new sort) in the laboratory, where the secondary instruments are installed, it is necessary to retry their calibration on the master instrument which is, as a rule, unfortunately placed very far from the laboratory (usually at the vendor place over hundreds kilometers from the laboratory).
Therefore it is offered to carry out the correction not of a spectrum of the secondary instrument for adaptation it to a spectrum of the master instrument, but on the contrary, to correct spectra of the master instrument, adjusting them to spectra of the same samples, measured on the secondary one. Then in case of appearance of new samples their spectra have to be simply measured on a secondary instrument and added to revised spectra of the master instrument delivered by the vendor together with calibration model. The calculation of improved calibration model is carried out then by a method of PLS or PCA over this aggregated set of spectra. The results of research of the metrology characteristics of the considered algorithm of adaptation of calibration models are discussed.
The recognition of composition and of microphysical characteristics of aerosol impurities is one of the most urgent tasks in atmosphere monitoring. Physical basics giving an opportunity of aerosol identification in remote sensing and some methods of solving this problem are considered. The concrete results of recognition and the dependence of recognition efficiency on the number of frequency channels and on signal to noise ratio in received multifrequency lidar signals are given.
A large number of properties were modeled for various organic compounds basing on their structural formulas, e.g. density, boiling points, viscosity, surface tension, magnetic susceptibility, lipophilicity (octanol-water distribution coefficient), critical temperature, flash points, polarizability, enthalpy of evaporation, etc.
Prediction of toxicity is a challenging problem which until now is not completely solved. However the toxicity for many industrially important compounds can be predicted basing on the computation of lipophilicity, taking into account the presence of toxophores and using fragmental descriptors.
The paper deals with the questions of development domestic chemometrics within two decades (1960-1970-es). In these years an infrastructure such new directions ware supporting by the Scientific Council of problem сybernetics of the Academy of Science USSR.
The great informal collective successfully working in different directions has been cre-ated: application of new methods in laboratory and industrial researches, extensive pub-lishing and educational activity, transformation of the education system in chemical higher schools, etc. Applied statistical methods were regularly used in analytical, organic, physical and in other areas of chemistry. Really it was one, that named as chemometrics today.
Five varieties of apples were studied using three different analytical techniques: HPLC, electronic tongue multisensor system based on potentiometric chemical sensors and FTIR spectroscopy. Twenty samples (apples) of each variety were measured. Juice was pressed from each fruit, clarified and deep-frozen before measurements. Juice samples were stored at -80 C and thawed before measurements. Concentrations of organic acids such as malic, citric, galacturonic etc. and sugars were measured by HPLC, which is a conventional method for fruit analysis in such case. HPLC data were used as reference for calibration of the electronic tongue and also FTIR.
Different aspects of data processing were addressed. Recognition of the apples according to the variety using data from three different analytical instruments was performed by PCA and PLS discrim. Quantitative calibration of the electronic tongue and FTIR with respect to organic acids and sugars content was done using PLS regression. Issues of obtaining complimentary information from electronic tongue and FTIR spectroscopy as well as merging of the data of different nature were considered.
Acknowledgements: This work was partly supported by the NATO Linkage Grant.
Chemists have accumulated a vast amount of data on rate constants of free radical reactions. The bond dissociation energy of organic molecules represents one of the important characteristics of compounds of radical reactions. These data can be structured and are assembled in electronic data collections.
The concepts Data Warehouse can be used for representation of experimental data radical kinetic and termochemistry of organic connections. The basic assignment Data Warehouse of the scientific data is an organization and support given for processing with the purpose of extraction of the new data or generalization available.
The storehouse given on kinetic of radical reactions was developed and termochemistry of organic connections (cumulative volume is estimated in 40000 records). It allows to put and to solve tasks of the statistical and comparative analysis of the kinetic data on various groups of such reactions (Knowledge Discovery).
Knowledge Discovery in data warehouse is a process of non-trivial extraction of implicit, previously unknown and potentially useful data about BDE organic compounds and its reactivity. Data mining is a step such process.
For the analysis of the kinetic data was used patters - empirical model of crossed parabolas prof. E. T. Denisova. Within the framework of this model enthaphy of radical abstaction reaction of atom of hydrogen is connected to energy of activation of reaction by parabolic dependence.
On the basis of this model the expert system (ES) for an estimation of BDE of organic molecules on rate constants of radical abstraction reactions was developed. Were calculated of BDE more than for 500 organic connections.
One more ES allows to predict reactionary ability compounds in elementary radical abstraction reaction. The testing ES on known experimental data has shown satisfactory results.
Both expert systems work on one knowledge base and use a small set of production rules. Both the knowledge base and data warehouse is realized as set of databases of special structure.
Thus, the created infrastructure of databases allows to put a task of development information portal for support of knowledge management in termochemistry of radical reactions.
The majority of studying environmental objects are complex systems. One of the research problems of such a systems lies in the determination of the relationship between system components' concentration and analytical signal. The traditional ways of solution of this problem is not always possible because of insufficient selectivity of the sensors. The development of chemometric multivariate data processing allows application of non-selective sensor arrays to the analysis of complex solutions. In this work multisensor system for voltammetric analysis of multi-component mixtures of aromatic nitrocompounds was developed. With the help of "Maple" package the algorithm for processing of sensor arrays responses was approbated. It includes the following steps: the construction of polynomial model of voltammetric behavior of nitroaromatic compounds mixtures, models testing and simultaneous determination of the mixture components. The multisensor system allows the quantitative determination of three-components mixtures with the concentration ratios (one component to the sum of two others) from 1:1 up to 1:5. The advantages of the suggested method are the fast analysis, low cost of analysis and data processing, versatility and the possibility of use for the classification of the analytical objects.
Investigations of swelling degree for rubbers containing 5 to 20 wt per cent of dibutylphtalate performed extreme character of dependence of diffusion coefficients on the amount of injected plasticizer, which allowed optimizing the material composition aiming at reduction of negative impact of hydrocarbon media on the properties of the materials. It is necessary to note that application of this approach (investigation of rubber swelling in hydrocarbon medium at room temperature and determination of diffusion coefficients by means of "FITTER" software) can be very useful while developing new recipes of rubbers performing controlled diffusion rate and swelling degree.
The 154 biomass samples, available for this study, have very different origin, such as wood, grass, reed, brewery waste, or poultry litter. Each sample has been characterized by the contents of carbon, hydrogen, nitrogen, oxygen, sulfur, chlorine, and ash. PCA of these data shows a good clustering according to the origin of the samples.
A subset of 122 samples, all consisting of plant materials, has been used to develop regression models for a prediction of HHV from the elemental composition. OLS- and PLS-models with best predictive ability have been obtained using the contents of carbon, C, hydrogen, H, and nitrogen, N, with the variables defined by C, C*C, H, C*H, and N. The standard errors of prediction of the new models are considerably smaller than those obtained with the many models reported in literature.
This work is supported by Russian Foundation for Basic Research, grant 04-03-32177.
Two models of portative two-channel hi precision FTIR spectrometers for near and middle spectral range were developed and tested by the Institute of Spectroscopy at 2003. The main parameters of them are shown in the site: www.isan.troitsk.ru
A method of middle range FTIR-ATR spectrometry with fiber-optic sensor and ZnSe cuvette was used for previous study of fat, protein and lactose simultaneous quantifying procedure in raw milk and fat, soya and palm oils identification procedure. The calibration was based on series of standard samples, produced by Institute of dairy industry, RAAS, and adapted statistical regression analysis. It is concluded that FTIR-ATR method offers a simple and efficient quality control method of milk analysis.