SelByExpr

Gene selection by query (logical expression).

Data specification

The expression data for the set of genes is represented as a table, consisting of rows (usually corresponding to genes) and columns (or fields, usually corresponding to samples/tissues/experiments). Each row corresponds to expression measurements for the gene. Columns correspond to experiments/samples/tissues. However, this table may include not only expression data, but also other information related to genes, for example gene names, classifiers, etc. Therefore we will call the table columns as 'fields' in general case. In general, columns of the table could be of four basic types:
IVALUE signed integer value;
FVALUE floating point value;
WORD text without spaces inside (single word);
STRING text with spaces inside allowed.

Fields are completely defined by their basic types and names.

SelTag Input file basic format

Basic input file format should be as follows:


; May contain comment starting from the semicolon in any line of the file
NAME<tab>WORD
GENEID<tab>IVALUE
TISSUECANCER0<tab>FVALUE
TISSUECANCER1<tab>FVALUE
TISSUENORMAL0<tab>FVALUE
TISSUENORMAL1<tab>FVALUE
TISSUENORMAL2<tab>FVALUE
#GROUP<tab>Cancer tissues
TISSUECANCER0
TISSUECANCER1
#ENDGROUP
#GROUP<tab>Arbitrary group
TISSUECANCER1
TISSUECANCER2
TISSUENORMAL0
TISSUENORMAL1
#ENDGROUP
END
DATA
GENE04675<tab>402<tab>6.00<tab>5.60<tab>5.97<tab>6.00<tab>6.00
GENE46890<tab>794<tab>2.77<tab>3.22<tab>5.65<tab>5.68<tab>5.68
GENE23794<tab>404<tab>5.97<tab>5.97<tab>6.00<tab>5.60<tab>5.97

In this example <tab> implies 'Tab' character symbol.

First lines (up to the "DATA" line) contain data format description. In this part of the file each line describes field description: field name and field basic type.

After the "DATA" line - data on each gene are represented. Each line correspond single cards. Field data are separated by 'tab' symbol. Double 'tab' is interpreted as missed data.

It is assumed in SetTag program that the expression data in the file are normalized and the expression levels of genes in experiments are comparable.

Selection files

MolQuest version of the SelTag program can also operates with other types of files, namely, selection files. These files contain information about some selected genes or samples from the large data file in SelTag format. The selection file contain: the data file name from which selection was obtained; type of selection data (genes of samples), list of selected objects (their indices in the large data file). The selection files are in the XML format. Two examples are below.

Selection for some genes.


<?xml version="1.0" encoding="ISO-8859-5"?>
<SELECTION>
	<HEADER name="cc_Selection5">
		<DATA source="c:/data/cc.txt"/>
		<COMMENT><![CDATA["$F1 == "GEN14263" | $F12 >= 300"]]></COMMENT>
	</HEADER>
	<ELEMENTS type="GENES" count="9">
	<![CDATA[0;1;2;10;14;15;17;26;30]]>
	</ELEMENTS>
</SELECTION>

Selection for some fields (samples).


<?xml version="1.0" encoding="ISO-8859-5"?>
<SELECTION>
	<HEADER name="notterman2001_set1">
		<DATA source="c:/data/notterman2001_set1.txt"/>
		<COMMENT><![CDATA["From cc.txt data file."]]></COMMENT>
	</HEADER>
	<ELEMENTS type="FIELDS" count="10">
	  <![CDATA[0;1;2;3;5;6;7;18;19;30]]>
	</ELEMENTS>
</SELECTION>

Selection files may be selected during the SelTag execution and also used by SelTag for calculation and/or visualization. Note, each selection file is linked to large data file by its name. Selection data cannot be applied to another data file.

Expression syntax

The logical expression contains field (experiment) indices denoted as $FX (where X is the field index) and relationships between values of the fields. For example, string
$F24 < 100
means that genes should be selected that have expression level for the field 24 lower then 100. To compare field values several operations can be used:

==equal
<less than
<=less or equal to
>greater than
>=greater or equal to
!=not equal

Complex queries may be formed using logical operations AND (&), OR (|), NOT (!) and parentheses for simple queries. For example, query
($F10 lt; 100 ) & ($F23 >= 0 )
should return all genes with expression level in the experiment #10 lower than 100 and expression level in experiment #23 greater or equal to zero.

Some additional operations may be used also.

+,-sum and difference
*,/multiply and divide by
ABS(x)absolute deviation of x
x^y x in y power
SQRT(x)square root of x

For example,
ABS($F10-$F11) < 100
Will select genes for which absolute deviation between expression levels in 10 and 11 experiments is lower than 100. Arithmetical operations are allowed with the numerical fields only.

Text comparison is also possible if the compared field is of the STRING or WORD types. For example, to select query with name "Gene2356" in the field $F1, one can set query
$F1=="Gene2356"
Note that the textual values is better to put in quotation marks, this will allow to process even strings containing spaces and special characters (arithmetical or logical operations described above).

Genes can be also selected by their numbers in data file, for example, query
$N <= 400
returns all genes with indices from 1 to 400.

Genes can be selected by their expression level in the field (experiment) group. For example, to select genes with the expression level greater than 100 in any of the experiment from group 1, the following query is applicable:
$G1 > 100

Condition level can be applied to the group selection, namely, user can specify the number of fields from the group satisfying condition. To select genes for which at least in 10 experiments expression level is greater than 100, the previous query can be modified:
$G1:10 > 100

The condition can be specified in percents of group size:
$G1:50% > 100
The latter query allow to select genes in which at least 50% experiments from group 1 have expression level greater than 100.

The score can be ascribed to the gene upon query evaluation. For example if the query is $F3 > 100 and there are two genes satisfying this condition with $F3 expression levels 105 and 800, the gene with expression level 800 will have greater score.

Example of the output data


List of selected genes and their scores [12 total]:
No	Index	Name	Score
1	1	GEN30482	0.5167
2	2	GEN03437	0.7767
3	3	GEN03687	0.9467
4	4	GEN24649	0.9600
5	5	GEN09108	0.2333
6	6	GEN09514	0.9933
7	7	GEN24589	0.7067
8	8	GEN02291	1.0233
9	9	GEN24534	0.9300
10	10	GEN14489	0.8000
11	11	GEN33519	0.8000
12	13	GEN35755	0.8633

First line is the header. It contains number of selected genes in parentheses. Second line is the data descriptions, separated by tabulation: No - number of the gene, Index - index of the gene in the large data file; Name - gene name (to determine name field in the data by default program searches the field that is called 'Name' in the field list names); Score - query scores (the better gene fits query expression, the higher the score). Next lines list data for selected genes separated by tabulation.