Benchmark

originally extracted from HampMap chr1 data , the initial graph has been filtered with r2 threshold = 0.5 and connected component have been extract with FESTA** (using DFS) Tagsnp selection occurs in genetics and polymorphism analysis. Single nucleotide polymorphisms, or SNPs, are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence of an individual is altered. For example a SNP might change the DNA sequence "AAGGCTAA" to "A{T}GGCTAA". For a variation to be considered a SNP, it must occur in at least 1% of the population. There are several millions SNPs in the 3 billions nucleotides long human genome, explaining up to 90% of all human genetic variation. SNPs may explain a portion of the heritable risk of common diseases and can affect respond to pathogens, chemicals, drugs, vaccines, and other agents. The TagSNP problem is a sort of lossy compression problem which consists in selecting a small subset of SNPs such that the selected SNPs, called tag SNPs, will capture most of the genetic information. The goal is to capture a maximally informative subset of SNPs to make screening of large populations feasible. The correlation measure r^2 between a pair of SNPs can be computed on a first small population. A tag SNP is considered as representative of another SNP if the two SNPs are sufficiently linked. The simplest TagSNP problem is to select a minimum number of SNPs (primary criteria) such that all SNPs are represented. This is captured by the fact that the r^2 measure between the two SNPs is larger than a threshold theta (often set to theta = 0.8~cite{Carlson04}). We therefore consider a graph where each vertex is a SNP and where edges are labelled by the r^2 measure between pairs of nodes. Edges are filtered if their label is lower than the threshold heta. The btained may have different connected components. The TagSNP problem then reduces to a set covering problem (NP-hard) on these components. In practice the number of optimal solutions may still be extremely large and secondary criteria are considered by state-of-the-art tools such as FESTA** (relying on two incomplete algorithms). Between tag SNPs, a low r^2 is preferred, to maximize tag SNP dispersion. Between a non tag SNP and its representative tag SNP, a high r^2 is preferred to maximize the representativity. Model: For a given connected graph G=(V,E), we build a binary WCSP with integer costs capturing the TagSNP problem with the above secondary criteria. For each SNP i, two variables i_s and i_r are used. i_s is a boolean variable that indicates if the SNP is selected as a tag SNP or not. The domain of i_r is the set of neighbors of i together with i itself. It indicates the representative tag SNP which covers i. For a SNP i, hard binary cost functions (with 0 or infinite costs) enforces the fact that i_s Rightarrow (i_r = i). Similar hard cost functions enforce (i_r = j) Rightarrow j_s with neighbor SNPs j in G. A unary cost function on every variable i_s generates an elementary cost U if the variable is {true}. The resulting weighted CSP captures the set covering problem defined by TagSNP. To account for the representativity, a unary cost function is associated with every variable i_r that generates cost when i_r eq i. In this case, the cost generated is int (100.frac{1-r^2_{i,i_r}}{1-theta}). For dispersion between SNPs i and j, a binary cost function between the boolean i_s and j_s is created which generates a cost of lfloor 100.frac{r^2_{ij}- theta}{1-theta} floor when i_s=j_s = {true} The resulting WCSP captures both dispersion and representativity. In order to keep these criteria as secondary, we just use a large enough value for U (the elementary cost used for tag selection). This problem is similar to a set covering problem with additional binary costs. Such secondary criteria are ignored by~cite{ACNZ+08}. Here, c2d yields a compact compiled representation of the set of solutions of the pure set covering problem, but the number of solutions is so huge (typically more than billions) that applying the second criteria on solutions generated by {c2d} would be too expensive. A direct compilation of the criteria in the d-DNNF does not seem straightforward and would probably necessitate a Max-SAT formulation as the authors acknowledge in their conclusion. The instances considered have been derived from human chromosome 1 data provided by courtesy of Steve Qin~cite{Qin05}. Two values, theta= 0.8 and 0.5 have been tried. For theta=0.8, a usual value in tag SNP selection, 43,251 connected components are identified among which we selected the 82 largest ones. These problems, with 33 to 464 SNPs, define WCSP with domain sizes ranging from 15 to 224 and are relatively easy. Solving to optimality selects 359 tag SNPs in 2h37' instead of 487 in 3' for FESTA-greedy (21% improvement) or 370 in 39h17' for FESTA-hybrid (3% improvement, 15-fold speedup). To get more challenging problems, we lowered theta to 0.5. This defined 19,750 connected components, among which 516 are not solved to optimality by FESTA. We selected the 25 largest one. These problems, with 171 to 777 SNPs have graph densities between 6% and 37%. They define WCSP with max domain size ranging from 30 to 266 and include between 8000 to 250,000 cost functions. The decomposability of these problems, estimated by the ratio between the treewidth of the MCS tree-decomposition and the number of variables varies from 14% to 23%. data used in paper Sanchez & AL ijcaii 09 **Z. S. Qin, S. Gopalakrishnan, and G. R. Abecasis. An . Efficient Genome Wide Tagging efficient comprehensive search algorithm for tagsnp selection using linkage disequilibrium criteria. Bioinformatics, 22(2) :220â€“225, 2006. benchmark organization: ./*.wcsp ==> 25 instances in toulbar2 format *.ub ==> optimal solution obtained by PLNE using cyplex 11.0 txt/*.txt ==> original data beware the connected component are not filtred ( edges with r2 >= 0.5 need to be remove before instances building) cplex/*.cplex ==> cplex format available for PLNE experimentation. question : allouche@toulouse.inra.fr

Filename	#var	Max dom	#const	ub	Arity max	Connectivity	Max Degree	Min Degree
9319.wcsp	562	58	14811	6477229	2	0.11	114	2
6389.wcsp	756	107	32004	14298409	2	0.19	213	4
14420.wcsp	626	70	13278	7360706	2	0.09	139	3
10442.wcsp	908	76	28554	21591913	2	0.09	151	2
6858.wcsp	992	260	103056	20162249	2	0.35	518	2
17034.wcsp	1142	123	47967	38318224	2	0.10	242	2
14226.wcsp	1058	95	36801	25665437	2	0.08	189	2
13931.wcsp	820	145	38775	21106731	2	0.23	289	2
11739.wcsp	506	50	11049	6130539	2	0.14	99	2
14359.wcsp	670	113	26739	11936390	2	0.19	225	2
16547.wcsp	624	77	13815	7254976	2	0.09	152	3
15757.wcsp	342	47	5091	2278611	2	0.10	92	2
9150.wcsp	1352	121	44217	43301891	2	0.06	241	2
4449.wcsp	464	64	12540	5094256	2	0.17	126	2
3792.wcsp	528	59	12084	6359805	2	0.13	117	2
6835.wcsp	496	90	18003	4571108	2	0.23	175	3
14007.wcsp	1554	100	54753	50290563	2	0.07	195	2
8956.wcsp	486	106	20832	6660308	2	0.23	209	3
13306.wcsp	594	93	17580	7737538	2	0.16	185	3
16421.wcsp	404	75	12138	3436849	2	0.20	149	4
16706.wcsp	438	30	6321	2632310	2	0.07	55	5
13964.wcsp	1040	266	149361	19542957	2	0.50	531	18
12976.wcsp	920	128	44823	21604644	2	0.27	255	2
4804.wcsp	716	123	40314	9089101	2	0.27	245	5
5264.wcsp	582	70	11400	6625331	2	0.10	138	2

Field	Predicat	Value

COST FUNCTION