GENERATE and LINEAR - MS-DOS utilities for processing of experimental series with systematic errors E.B. Rudnyi Department of Chemistry Moscow State University 119899 Moscow, Russia e-mail RUDNYI@MCH.CHEM.MSU.SU (C) 1994, All rights reserved Purpose: estimation of unknown parameters of the models yij = a + b xij + eij yij = a + eij yij = b + eij with use of values of several experiments containing systematic errors Feature: estimation of unknown variance components by the maximum likelihood method Requirements: IBM-compatible computer with MS-DOS (version 3.3 and higher). The code need about 80 Kb of memory by itself, some memory is also needed for your data. The better computer the faster LINEAR works, although it should work even in the worst configuration. Files in the archives: LINEAR.EXE - the utility to process data GENERATE.EXE - the utility to generate pseudo-experimental values to test LINEAR ONEWAY.CFG - configuration file of GENERATE to imitate one-way classification results LINE.CFG - configuration file of GENERATE to imitate linear regression results ONEWAY1.DAT - examples of data files for one-way classification. ONEWAY2.DAT They have been used to draw fig. 1 in ref. [1] ONEWAY3.DAT ONEWAY4.DAT LINE1.DAT - examples of data files for linear regression. LINE2.DAT They have been used to draw fig. 2 in ref. [1] LINE3.DAT LINE4.DAT README.TXT - this file [1] Rudnyi, E.B. "Combined processing of experimental series with systematic errors - non-linear physico-chemical model with linear error model". Presented at InCINC'94, The First International Chemometrics InterNet Conference, 1994. See file EBRDOC.PS (postscript format) or EBRDOC.TXT (plain ASCII format). If you can not find them, contact me and I will tell you the site from which it is available by anonymous ftp. LICENSE This program is freeware (public domain). Feel free to use and distribute it, provided no charge is taken. I will be glad if you like this program. Let me know if you find any bugs. I would also appreciate your comments. Disclaimer of warranty: This program is supplied as is. I disclaim all warranties, express or implied, including, without limitation, the warranties of merchantability and of fitness of this program for any purpose. I assume no liability for damages direct or consequential, which may result from the use of this program. CONTENTS 1. Introduction 2. Reference for the utility GENERATE 3. Reference for the utility LINEAR 3.1. Command line 3.2. Data file 3.3. Configuration file 3.4. Output file 4. Limits within the realization LIST OF SYMBOLS If not mentioned otherwise, greek means small greek letter. Sumj = capital_greek_sigma summation over j j sqrt square root (+/-) plus-minus yij = y experimental observation ij i = 1, ..., M index enumerating series M number of all the series j = 1, ..., Ni index enumerating points in the i-th series Ni = N number of points in the i-th series i SS general sum of squared deviations L function, the maximum of which coincides with the maximum of the likelihood function eij = greek_epsilon difference between the experimental ij and calculated values er_ij = greek_epsilon reproducibility error in the yij r,ij ea_i = greek_epsilon shift systematic error in the i-th a,i series eb_i = greek_epsilon tilt systematic error in the i-th b,i series 2 sr_i2 = greek_sigma variance of reproducibility errors in r,i the i-th series 2 sa_i2 = greek_sigma variance of shift systematic error in a,i the i-th series 2 sb_i2 = greek_sigma variance of tilt systematic error in b,i the i-th series ga_i = greek_gamma = sa_i2/sr_i2 a,i gb_i = greek_gamma = sb_i2/sr_i2 b,i Qi = Q set of series when sr_i2 = sr_2 i Qa = Q set of series when ga_i = ga a Qb = Q set of series when gb_i = gb b xij = x ij xi = x = (Sumj xij)/Ni mean of x in the i-th series i 2 Pi = P = Sumj (xij - xi) i 1. INTRODUCTION This software was written to accompany my paper presented at InCINC'94 (see ref. 1 above). Its purpose is to demonstrate the opportunities of the approach described in the paper. However, this is not a demonstration software, it can be used for solving real problems. It would be better if you start by reading the paper [1] (at least sections 1 and 2). Then you will know better the notation and the main idea. The utility LINEAR is designed to estimate unknown parameters in three models yij = a + b xij + eij yij = a + eij yij = b + eij from experimental values of several experiments (note that different experiments can be treated under different models). It is assumed that the errors behaviour can be described as eij = er_ij + ea_i + eb_i (xij - xi) what I call the linear error model. Such an error model simulates not only the reproducibility errors er_ij but also the shift ea_i and tilt eb_i systematic errors. Details can be found in ref. 1. You need nothing but LINEAR.EXE to process your data. However, to make more fun, the utility GENERATE was also created. It uses a random generator to obtain pseudo-experimental values of several experiments with systematic errors. Such values are written in a file which can be processed by LINEAR. Thus, it allows you to check the approach by employing it to the case where the answer is known (On this way, fig. 1 and 2 in ref. 1 was created). Enjoy and have fun! 2. REFERENCE FOR THE UTILITY GENERATE The utility generates pseudo-experimental values of several experiments containing systematic errors with structure of one-way classification or linear regression. The values are scattered according to the normal distribution. The format of a command line is GENERATE conf_file [out_file sga sgb] conf_file - a name of the file (by default extension .CFG), where the description of the task should be resided. This is a plain-ascii file and it can be created and edited with any editor (save results as text only). The files ONEWAY.CFG and LINE.CFG will give you information enough what such a file should looks like. out_file - a name of the file (extension .DAT is ascribed, default name NUL) for results of the generation. This is a plain-ascii file but he is written in such a format that can be processed by LINEAR. sga - a value of sqrt(ga) (default zero) sgb - a value of sqrt(gb) (default zero) A value from the file GENERATE.RND is employed as a seed for the random generator. It changes after each call of GENERATE. If such file does not exist the BIOS timer is used to make a seed value for the first time. GENERATE also makes output to the screen. First, the instances of systematic error goes and then pseudo-experimental values rounded to integers. 3. REFERENCE FOR THE UTILITY LINEAR The utility works in the batch mode only - it takes experimental values from the file and creates another file of results. 3.1. Command line The format of a command line is LINEAR [-fl1 -fl2 ...] data_file [conf_file] The values of flags are as follows -o to set a name to the output file, -h to obtain short help, -l to obtain license information, -p output file contains final results only (default), -p1 output file contains also results of main iterations, -p2 output file contains results all the iterations (free space on the disk), -s automatic choice for displaying parameters (default), -s1 output only free parameters, -s2 output all the parameters. By default the data file (data_file) has extension .DAT. If there is no name for the configuration file (conf_file is absent) LINEAR tries to find the configuration file with the name of the data file and the extension .SET. If the flag -o is absent, LINEAR takes the name of the configuration file or the data file, changes the extension to .LST and makes the result as a name of the output file. 3.2. Data file A data file is written in the free format. White space is recognised as the word delimiter. The file consists from the series separated by semicolon. Each series comprises following fields separated by commas series_name, equation_name, variables_names, point1, point2, ...; series_name - this is any identifier (list of symbols without space, comma or semicolon). equation_name - one of three words - line, justa, justb - describing the next models yij = a + b xij + eij yij = a + eij yij = b + eij variables_names - one or a few identifiers separated by space. Their number determines a total number of values in each point. The identifiers themselves are not used but in the output. point1 - one of a few values separated by space. The number of values to read is equal to the number of names of variables. For the equation line, first two values from the point are used, the first as y, the second as x. For the equations justa and justb, the first value is taken from each point. If there are a few words in the field series_name or equation_name, the first only is taken and all the others are ignored. If there are more values in the field pointN than the number of the names of variables, the extra values are ignored. If the number of values is less than that, the absent are initialized by zeros. Such rules permit you to write comments in the fields series_name, equation_name and pointN. While reading the utility LINEAR sorts series by alphabet. You can easily exclude either a series or a point from processing. To this end, the symbol * can be placed in the data file. To exclude a series - put the symbol * before the name of the series, to exclude a point - put the symbol * in the beginning of the field pointN. You also are able to exclude a series in the configuration file. Although points and series marked with * don't take place in the calculations, they will be presented in the output file, and also, their deviates and variance components will be estimated. The files ONEWAY?.DAT and LINE?.DAT are the examples of the data files to be processed by LINEAR. Also, each output file of the utility GENERATE can be viewed as such an example. 3.3. Configuration file The configuration file is optional. It is for experienced users, you must have read ref. [1] before you start creating your own configuration file. If configuration file is absent, the utility LINEAR makes one calculation with default hypotheses - all the series are assumed to have the same reproducibility variances sr_i2 = sr_2 and the same quantities ga_i = ga and gb_i = gb. It is a good start for many applications. The configuration file is written in the free format and contains the descriptions of the series and the description START. These descriptions must be finished by semicolon. The LINEAR reads a series description, modifies the hypotheses accordingly and continue reading with the next series. When the description START appears, the utility starts the calculation. After that, the process of reading of the configuration file continues. The format of the description START is START [[*] par_name init_val [, ...]]; * - this symbol, if present, means that the parameter will be kept constant during the maximisation of the likelihood in this and next calculations (until redefining). par_name - a name of the parameter (a or b only). init_val - a value to be used as initial. If absent, the value from the previous calculation is taken or zero in the first calculation. All the symbols after the initial value until comma (semicolon) are ignored. You can put a comment there. The format of a series description is [*] ser_name, hyp_fl sri, hyp_fl sga, hyp_fl sgb; * - this symbol, if present, means that the series will be ignored in this and next calculations (until redefining). ser_name - a name of the series. Again, all the words after the first will be ignored until comma. hyp_fl - a hypothesis flag - one of three characters #, % or *. The character # means that this variance component belongs to the set with the same variance (the default hypothesis), the character % shows that this variance component will drift apart and the character * makes the variance component constant (it won't change in the maximization procedure). sri - an initial value of the standard deviation of reproducibility. sga - an initial value of sqrt(ga_i). sgb - an initial value of sqrt(gb_i). If a hypothesis flag is absent, the LINEAR takes one from the previous calculation (in the first calculation - the default hypothesis). If an initial value (sri, sga or sgb) is absent, the LINEAR takes one from the previous calculation (in the first calculation - the default value sri = 1, sga = 0 or sgb = 0). 3.4. Output file If a flag -p1 or -p2 is put on the command line, the output file starts by intermediate iteration results. Please, read ref. 1 to understand them. The final results separated on sections. a) The convergence condition. The number of big iterations. difL - the relative difference between the value of L in two last iterations. difv - the maximum relative difference of variance components in two last iterations. L - the values of the function L. SS - the value of the generalized sum of squares. b) The parameter estimates and their dispersion matrix. After (+/-) the standard deviation is given. Standard deviations and dispersion matrix is obtained for free parameters only. c) The estimates of the variance components. ID - the series name. eq - the equation name. sr - the standard deviation of reproducibility. sga - a value of sqrt(ga_i). sgb - a value of sqrt(gb_i). Before values of sr, sga and sgb, there are symbols displaying the hypotheses used (see above). If a series did not take place in the processing, a symbol * is put before its name. d) The analysis of deviates (see section 4.1 in ref. 1) 2 av_dev = sqrt{(Sumj eij )/Ni} the average total deviation. 2 sri = sqrt{(Sumj er_ij ) /Ni} min the average reproducibility deviation. err_a = ea_i the shift over fitting equation. err_b = eb_i the tilt over fitting equation at mean value of x. err1_b = eb_i sqrt(Pi/Ni). e) Some series values. Ntot - the total number of the points. Ns - the number of points took part in the processing. xav - the mean value of x. Ps = sqrt(Pi). d) Experimental values by series and their deviates. err_full - the total deviation from the fitting equation. err_a - the deviation from the equation shifted by ea_i over the fitting equation. err_ab - the deviation from the equation shifted and tilted over the fitting equation. If a point did not take place in the processing, a symbol * is put before it. 4. LIMITS WITHIN THE REALIZATION a) The models can not be changed. If you like the approach and want to apply it for non-linear models, try to contact me. 2) The upper limit is set for values of ga and gb. ga can not be more than 10000, gb can not be more that 10000000. 3) The convergence conditions can not be change.