Home Benchmark
Benchmark_cart
Your selection Format : wcsp
#Instances : 0
Format filter applied : wcsp
#Instance :
Benchmark



login



online
We have 103 guests online

Bench Home/real/tagsnp/Tagsnp_r0.5


Notice: Undefined property: BenchmarkViewBrowse::$hasAnonymousSelections in /var/www/html/components/com_benchmark/views/browse/tmpl/anonymous.php on line 21

Operations on selected instances
Add to a new selection





Select a selection

Benchmark description Extend/Reduce
tagsnp selection is set covering problem with additional criter.
in the current benchmark,the 25 biggest instances are proposed as benchmark.
originally extracted from HampMap chr1 data , the initial graph has been filtered with r2 threshold = 0.5 and connected component have been extract with FESTA** (using DFS) Tagsnp selection occurs in genetics and polymorphism analysis. Single nucleotide polymorphisms, or SNPs, are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence of an individual is altered. For example a SNP might change the DNA sequence "AAGGCTAA" to "A{T}GGCTAA". For a variation to be considered a SNP, it must occur in at least 1% of the population. There are several millions SNPs in the 3 billions nucleotides long human genome, explaining up to 90% of all human genetic variation. SNPs may explain a portion of the heritable risk of common diseases and can affect respond to pathogens, chemicals, drugs, vaccines, and other agents. The TagSNP problem is a sort of lossy compression problem which consists in selecting a small subset of SNPs such that the selected SNPs, called tag SNPs, will capture most of the genetic information. The goal is to capture a maximally informative subset of SNPs to make screening of large populations feasible. The correlation measure r^2 between a pair of SNPs can be computed on a first small population. A tag SNP is considered as representative of another SNP if the two SNPs are sufficiently linked. The simplest TagSNP problem is to select a minimum number of SNPs (primary criteria) such that all SNPs are represented. This is captured by the fact that the r^2 measure between the two SNPs is larger than a threshold theta (often set to theta = 0.8~cite{Carlson04}). We therefore consider a graph where each vertex is a SNP and where edges are labelled by the r^2 measure between pairs of nodes. Edges are filtered if their label is lower than the threshold heta. The btained may have different connected components. The TagSNP problem then reduces to a set covering problem (NP-hard) on these components. In practice the number of optimal solutions may still be extremely large and secondary criteria are considered by state-of-the-art tools such as FESTA** (relying on two incomplete algorithms). Between tag SNPs, a low r^2 is preferred, to maximize tag SNP dispersion. Between a non tag SNP and its representative tag SNP, a high r^2 is preferred to maximize the representativity. Model: For a given connected graph G=(V,E), we build a binary WCSP with integer costs capturing the TagSNP problem with the above secondary criteria. For each SNP i, two variables i_s and i_r are used. i_s is a boolean variable that indicates if the SNP is selected as a tag SNP or not. The domain of i_r is the set of neighbors of i together with i itself. It indicates the representative tag SNP which covers i. For a SNP i, hard binary cost functions (with 0 or infinite costs) enforces the fact that i_s Rightarrow (i_r = i). Similar hard cost functions enforce (i_r = j) Rightarrow j_s with neighbor SNPs j in G. A unary cost function on every variable i_s generates an elementary cost U if the variable is {true}. The resulting weighted CSP captures the set covering problem defined by TagSNP. To account for the representativity, a unary cost function is associated with every variable i_r that generates cost when i_r eq i. In this case, the cost generated is int (100.frac{1-r^2_{i,i_r}}{1-theta}). For dispersion between SNPs i and j, a binary cost function between the boolean i_s and j_s is created which generates a cost of lfloor 100.frac{r^2_{ij}- theta}{1-theta} floor when i_s=j_s = {true} The resulting WCSP captures both dispersion and representativity. In order to keep these criteria as secondary, we just use a large enough value for U (the elementary cost used for tag selection). This problem is similar to a set covering problem with additional binary costs. Such secondary criteria are ignored by~cite{ACNZ+08}. Here, c2d yields a compact compiled representation of the set of solutions of the pure set covering problem, but the number of solutions is so huge (typically more than billions) that applying the second criteria on solutions generated by {c2d} would be too expensive. A direct compilation of the criteria in the d-DNNF does not seem straightforward and would probably necessitate a Max-SAT formulation as the authors acknowledge in their conclusion. The instances considered have been derived from human chromosome 1 data provided by courtesy of Steve Qin~cite{Qin05}. Two values, theta= 0.8 and 0.5 have been tried. For theta=0.8, a usual value in tag SNP selection, 43,251 connected components are identified among which we selected the 82 largest ones. These problems, with 33 to 464 SNPs, define WCSP with domain sizes ranging from 15 to 224 and are relatively easy. Solving to optimality selects 359 tag SNPs in 2h37' instead of 487 in 3' for FESTA-greedy (21% improvement) or 370 in 39h17' for FESTA-hybrid (3% improvement, 15-fold speedup). To get more challenging problems, we lowered theta to 0.5. This defined 19,750 connected components, among which 516 are not solved to optimality by FESTA. We selected the 25 largest one. These problems, with 171 to 777 SNPs have graph densities between 6% and 37%. They define WCSP with max domain size ranging from 30 to 266 and include between 8000 to 250,000 cost functions. The decomposability of these problems, estimated by the ratio between the treewidth of the MCS tree-decomposition and the number of variables varies from 14% to 23%. data used in paper Sanchez & AL ijcaii 09 **Z. S. Qin, S. Gopalakrishnan, and G. R. Abecasis. An . Efficient Genome Wide Tagging efficient comprehensive search algorithm for tagsnp selection using linkage disequilibrium criteria. Bioinformatics, 22(2) :220–225, 2006. benchmark organization: ./*.wcsp ==> 25 instances in toulbar2 format *.ub ==> optimal solution obtained by PLNE using cyplex 11.0 txt/*.txt ==> original data beware the connected component are not filtred ( edges with r2 >= 0.5 need to be remove before instances building) cplex/*.cplex ==> cplex format available for PLNE experimentation. question : allouche@toulouse.inra.fr
Search
Field Predicat Value

Filename #var Max dom #const ub Arity max Connectivity Max Degree Min Degree Closed
9319.wcsp 562 58 14811 6477229 2 0.11 114 2
6389.wcsp 756 107 32004 14298409 2 0.19 213 4
14420.wcsp 626 70 13278 7360706 2 0.09 139 3
10442.wcsp 908 76 28554 21591913 2 0.09 151 2
6858.wcsp 992 260 103056 20162249 2 0.35 518 2
17034.wcsp 1142 123 47967 38318224 2 0.10 242 2
14226.wcsp 1058 95 36801 25665437 2 0.08 189 2
13931.wcsp 820 145 38775 21106731 2 0.23 289 2
11739.wcsp 506 50 11049 6130539 2 0.14 99 2
14359.wcsp 670 113 26739 11936390 2 0.19 225 2
16547.wcsp 624 77 13815 7254976 2 0.09 152 3
15757.wcsp 342 47 5091 2278611 2 0.10 92 2
9150.wcsp 1352 121 44217 43301891 2 0.06 241 2
4449.wcsp 464 64 12540 5094256 2 0.17 126 2
3792.wcsp 528 59 12084 6359805 2 0.13 117 2
6835.wcsp 496 90 18003 4571108 2 0.23 175 3
14007.wcsp 1554 100 54753 50290563 2 0.07 195 2
8956.wcsp 486 106 20832 6660308 2 0.23 209 3
13306.wcsp 594 93 17580 7737538 2 0.16 185 3
16421.wcsp 404 75 12138 3436849 2 0.20 149 4
16706.wcsp 438 30 6321 2632310 2 0.07 55 5
13964.wcsp 1040 266 149361 19542957 2 0.50 531 18
12976.wcsp 920 128 44823 21604644 2 0.27 255 2
4804.wcsp 716 123 40314 9089101 2 0.27 245 5
5264.wcsp 582 70 11400 6625331 2 0.10 138 2
English (United Kingdom)