Lagrangian Relaxation - Solving NP-hard Problems in Computational Biology via Combinatorial Optimization

Stefan Canzar

Résumé

This thesis is devoted to two $\mathcal{NP}$-complete combinatorial optimization problems arising in computational biology, the well-studied \emph{multiple sequence alignment} problem and the new formulated \emph{interval constraint coloring} problem. It shows that advanced mathematical programming techniques are capable of solving large scale real-world instances from biology to optimality. Furthermore, it reveals alternative methods that provide approximate solutions. In the first part of the thesis, we present a \emph{Lagrangian relaxation} approach for the multiple sequence alignment (MSA) problem. The multiple alignment is one common mathematical abstraction of the comparison of multiple biological sequences, like DNA, RNA, or protein sequences. If the weight of a multiple alignment is measured by the sum of the projected pairwise weights of all pairs of sequences in the alignment, then finding a multiple alignment of maximum weight is $\mathcal{NP}$-complete if the number of sequences is not fixed. The majority of the available tools for aligning multiple sequences implement heuristic algorithms; no current exact method is able to solve moderately large instances or instances involving sequences exhibiting a lower degree of similarity. We present a branch-and-bound (B\&B) algorithm for the MSA problem.\ignore{the multiple sequence alignment problem.} We approximate the optimal integer solution in the nodes of the B\&B tree by a Lagrangian relaxation of an ILP formulation for MSA relative to an exponential large class of inequalities, that ensure that all pairwise alignments can be incorporated to a multiple alignment. By lifting these constraints prior to dualization the Lagrangian subproblem becomes an \emph{extended pairwise alignment} (EPA) problem: Compute the longest path in an acyclic graph, that is penalized a charge for entering ``obstacles''. We describe an efficient algorithm that solves the EPA problem repetitively to determine near-optimal \emph{Lagrangian multipliers} via subgradient optimization. The reformulation of the dualized constraints with respect to additionally introduced variables improves the convergence rate dramatically. We account for the exponential number of dualized constraints by starting with an empty \emph{constraint pool} in the first iteration to which we add cuts in each iteration, that are most violated by the convex combination of a small number of preceding Lagrangian solutions (including the current solution). In this \emph{relax-and-cut} scheme, only inequalities from the constraint pool are dualized. The interval constraint coloring problem appears in the interpretation of experimental data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spectroscopy is a method used to obtain information about protein tertiary structure. The output of these experiments provides aggregate data about the exchange rate of residues in overlapping fragments of the protein backbone. These fragments must be re-assembled in order to obtain a global picture of the protein structure. The interval constraint coloring problem is the mathematical abstraction of this re-assembly process. The objective of the interval constraint coloring problem is to assign a color (exchange rate) to a set of integers (protein residues) such that a set of constraints is satisfied. Each constraint is made up of a closed interval (protein fragment) and requirements on the number of elements in the interval that belong to each color class (exchange rates observed in the experiments). We introduce a polyhedral description of the interval constraint coloring problem, which serves as a basis to attack the problem by integer linear programming (ILP) methods and tools, which perform well in practice. Since the goal is to provide biochemists with all possible candidate solutions, we combine related solutions to equivalence classes in an improved ILP formulation in order to reduce the running time of our enumeration algorithm. Moreover, we establish the polynomial-time solvability of the two-color case by the integrality of the linear programming relaxation polytope $\mathcal{P}$, and also present a combinatorial polynomial-time algorithm for this case. We apply this algorithm as a subroutine to approximate solutions to instances with arbitrary but fixed number of colors and achieve an order of magnitude improvement in running time over the (exact) ILP approach. We show that the problem is $\mathcal{NP}$-complete for arbitrary number of colors, and we provide algorithms that, given an instance with $\mathcal{P}\neq\emptyset$, find a coloring that satisfies all the coloring requirements within $\pm 1$ of the prescribed value. In light of our $\mathcal{NP}$-completeness result, this is essentially the best one can hope for. Our approach is based on polyhedral theory and randomized rounding techniques. In practice, data emanating from the experiments are noisy, which normally causes the instance to be infeasible, and, in some cases, even forces $\mathcal{P}$ to be empty. To deal with this problem, the objective of the ILP is to minimize the total sum of absolute deviations from the coloring requirements over all intervals. The combinatorial approach for the two-color case optimizes the same objective function. Furthermore, we use this combinatorial method to compute, in a Lagrangian way, a bound on the minimum total error, which is exploited in a branch-and-bound manner to determine all optimal colorings. Alternatively, we study a variant of the problem in which we want to maximize the number of requirements that are satisfied. We prove that this variant is $\mathcal{APX}$-hard even in the two-color case and thus does not admit a polynomial time approximation scheme (PTAS) unless $\mathcal{P}=\mathcal{NP}$. Therefore, we slightly (by a factor of $(1+\epsilon)$) relax the condition on when a requirement is satisfied and propose a \emph{quasi-polynomial time approximation scheme} (QPTAS) which finds a coloring that ``satisfies'' the requirements of as many intervals as possible.

Lagrangian Relaxation - Solving NP-hard Problems in Computational Biology via Combinatorial Optimization

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager