MSCalcParamLDA Calculate parameters for linear discriminant analysis of cancer/normal samples using MS data.

The Softberry SMS program package allows to perform prediction of patient decease outcome (for example cancer/normal) based on MS data of the samples. To perform this analysis, the peak intensity of the training set of the MS data should be presented in table format first. The table can be obtained using CreateTable program. The resulting table is shown in fig. 1. Each row contain prediction parameters for the sample. First row of the table should contain parameter names: sample ID, time of sampling in months, patient ID, sample case (0 for control, 1 for cancer or decease), – level of additional marker (for example logCA125). Then several columns contain log of the MS peak intensity data for each sample. They called MZ_NNNN, where NNNN is the average m/z value for peak group data detected by MSPeakAlign program (see help for CreateTable program).

The CalculateLDAParameters seach for the linear combination of features (the level of additional marker and peak intensities) which characterize or separate two classes of samples. The sample class is specified in the 4th column of input table (‘case’). For samples from patients with cancer detected it should be set to 1 (class 1); for control samples it should be set to 0 (class 0).

sample time patient ID case logCA125 MZ_4055 MZ_4211 MZ_4094 MZ_4283
1 6 123 1 2.5494 4.9672 5.6334 3.4828 4.9764
2 3 123 1 2.5014 4.1851 5.0689 3.5898 5.7517
3 1 123 1 2.2513 4.5871 5.4129 3.9384 5.687
4 3 345 0 2.3609 4.3681 6.2476 4.3789 3.4294

Figure 1. Example of the output of CreateTable program. 1st column of the table is sample ID, 2nd is time of sampling in months, 3rd – patient ID, 4th – case (0 for control, 1 for cancer or decease), logCA125 – level of additional marker, MZ_NNNN – log of the peak intensity with location of the peak at m/z= NNNN.

User can limit time of sampling (in month before diagnosis) for cancer patients included in the cancer training set. This limit is set by ‘Maximal time value of sampling’ parameter. For example, if this parameter is set to 6, only samples from cancer patients taken 6 months prior to cancer diagnosis will be taken and considered as ‘cancer’ samples.

The program uses two features to build classifier: the logarithm of the CA125 tumor marker level (5th column of the input table) and logarithm of intensity of MS signal for one of the peak groups. The program combines additional marker level with each of the MS peak data, calculate LDF function for each combination and select the peak group that gives the best performance of the LDA classifier. The program output is the LDF function for the best combination of additional marker and MS peak intensity data. It can be used further by MSPredictLDA program for sample classification. This output produced when Output data format parameter is set to ‘LDF’. When this parameter is set to ‘STAT’ value, program outputs classification performance for each combination of the additional marker level and MS peak data.

Input: Sample parameters in a special table format produced by the CreateTable program.
Output: LDF parameters that can be used for sample classification based on its MS data.

Parameter(s):
Data file - Text file should contain results of MSCreateTable program output.
Peak group data - Text file should contain results of MSPeakAlign program output for the same sample set as in table data.
Number of best peaks to consider - This parameter specify number of peak data (top represented in sample set) to check for LDA calculation.
Output data format - This parameter specify type of output. LDF means parameter output.
Maximal time value of sampling - his parameter define maximal time of sampling (in month before diagnosis) for cancer patients to limit samples to be included in the cancer training set. This value specified for the sample in the Table data in column "time".