Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA

Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA

Abstract

Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far limited their use to moderately sized networks. In this paper, we address the problem of aligning two metabolic networks, particularly when both of them are too large to be dealt with using existing methods. We develop a generic framework that can significantly improve the scale of the networks that can be aligned in practical time. Our framework has three major phases, namely the

Background

Biological networks provide a compact representation of the roles of different biochemical entities and the interactions between them. Depending on the types of entities and interactions, these networks are segregated into different types, where each network type encompasses a particular set of biological processes. Protein-protein interaction (PPI) networks comprise binding relationships between two or more proteins to carry out specific cellular functions such as signal transduction. Regulatory networks consist of interactions between genes and gene products to control the rates at which genes are transcribed. Metabolic networks represent sets of chemical reactions that are catalyzed by enzymes to transform a set of metabolites into others to maintain the stability of a cell and to meet its particular needs. Analysis of the connectivity properties of these networks has proven to be crucial in uncovering the details of the cell machinery and in revealing the functional modules and complexes involved in this mechanism

An essential type of network analysis is the comparative analysis that aims at identifying functionally similar elements or element sets shared among different organisms which would not be possible if these elements were only considered individually. This is often achieved through alignment of the networks of these organisms. Analogous to sequence alignment which identifies conserved sequences, network alignment reveals connectivity patterns that are conserved among two or more organisms. A number of studies have been done to systematically align different types of biological networks

Comparative analysis is important particulary for large metabolic networks such as organism-wide networks. Identification of the conserved patterns among metabolic networks across species provide insights for metabolic reconstruction of a newly sequenced genome

In this paper, we develop a framework that significantly improves the scale of the networks that can be aligned using existing algorithms. Our framework has three major phases, namely the

Aligning two metabolic networks with and without compression

**Aligning two metabolic networks with and without compression.** Top figures (a-c) illustrate the steps of alignment without compression. Bottom figures (d-g) demonstrate different phases of alignment with compression using our framework. (a) Two hypothetical metabolic networks with 5 and 4 reactions respectively. Directed edges represent the neighborhood relations between the reactions. (b) Support matrix of size 20

We can best motivate the need for such a framework on an example. Figure ^{2}^{2})) for IsoRank and for SubMAP when only subnetworks of size one are allowed. Figures

Notice that when we compress the network more (i.e., increase the number of compression levels), the compressed network gets smaller in terms of its number of nodes and edges. As a result, we can expect to align the compressed networks faster. However, this comes at the price of two drawbacks both due to the fact that each supernode contains multiple nodes from the original domain. First, once we find a mapping for the supernodes in the compressed domain, we still need to align the nodes of each supernode pair. For example, after mapping the supernodes (a, b) and (

Several key questions follow from these observations are:

1. How does compression affect the alignment accuracy with respect to the base network alignment method?

2. How far is our compression method from an optimal compression that produces the compressed network with the minimum number of nodes?

3. When is it a good idea to do the alignment in compressed domain taking into account the overhead of compression and refinement phases?

4. What is the right amount of compression? That is, when does compression minimize the running time of our overall framework?

In the rest of the paper we address each of these questions in detail. At this point, it is important to notice the potential for leveraging the alignment of larger scale networks by the framework we are proposing. The actual performance gain for an alignment will depend on the level of compression we use, the topologies of the query networks and complexity of the base alignment method.

Results overview

Our experiments on metabolic networks extracted from KEGG pathway database

Technical contributions

- We devise an efficient framework for the network alignment problem that employs a scalable compression method which shrinks the given networks while respecting their topology.

- We prove the optimality of our compression method under certain conditions and provide a bound on how much our compression results can deviate from the optimal solution in the worst case.

- We provide a mathematical formulation that serves as a guideline to select an optimal number of compression levels depending on the input characteristics of the alignment.

- We characterize the cases for which the proposed framework is expected to provide significant improvement in alignment performance.

In the next section, we report our experimental results on a set of large scale metabolic networks that are constructed by combining networks from KEGG Pathway database

Results and discussion

In this section, we experimentally evaluate the performance of our framework. First, we measure the compression rates achieved for different levels of compression with minimum degree selection (

Next, we further analyzed the changes in degree distribution and large scale organization of organism-wide metabolic networks with increasing compression levels. We, then, examine the gain in running time and memory utilization achieved by our framework for different values of compression level (

Dataset

We use the metabolic networks from the KEGG pathway database

In order to obtain our

1. Carbohydrate Metabolism

2. Energy Metabolism

3. Lipid Metabolism

4. Nucleotide Metabolism

5. Amino Acid Metabolism

6. Metabolism of Other Amino Acids

7. Glycan Biosynthesis and Metabolism

8. Metabolism of Cofactors and Vitamins

9. All Amino Acids (Amino Acid + Other Amino Acids)

Implementation and system details

We implemented our compression and alignment algorithms in C_{++}. We ran all the experiments on a desktop computer running Red Hat Enterprise Client 5.7 with 4 GB of RAM and two dual-core 2.40 GHz processors.

Evaluation of compression rates

The efficiency of our alignment framework depends on how much the query metabolic networks can be compressed. For this reason, in this experiment, we measure the number of nodes and edges of the metabolic networks in our large scale dataset before and after compression.

The minimum degree selection (

Table

Summary of compression rates for all the networks in our large scale dataset

**
Network size intervals
**

**
Average number of nodes
**

**
Average number of edges
**

**c = 0**

**c = 1**

**c = 2**

**c = 3**

**c = 0**

**c = 1**

**c = 2**

**c = 3**

[0, 100)

41.5

26.5

**26.5**

19.1

**19.1**

15

**14.8**

83.5

55.2

**55.5**

36.3

**36.5**

23.6

**23.5**

[100, 200)

154.8

92.4

**92.2**

61.3

**61.5**

48.6

**48.6**

310.1

174.9

**174**

116.5

**118.1**

96.3

**94.6**

[200, 300)

240.5

139.1

**139.4**

89.2

**89.1**

69.4

**69.7**

508.1

296.5

**298.4**

230.5

**228.4**

187.8

**188.1**

[300, 400]

344.9

207.3

**207.6**

133.1

**133.8**

103

**104.5**

585.7

372.9

**373.5**

302.7

**300.4**

261.6

**259.9**

[850, 1250]

1080.5

623.2

**623.7**

406.8

**407.9**

311.3

**311.9**

3727

2269

**2280.6**

1732.7

**1733.8**

1584.8

**1587.5**

[1500, 1615]

1576.5

909

**910**

582

**583**

447.8

**444.6**

4740

2955.2

**2964.3**

2283.5

**2279.3**

2128.8

**2129.6**

We create six intervals according to number of reactions in these networks. Each row, corresponding to one such interval, shows the average number of nodes and edges before compression (i.e., **bold **correspond to the averages of 100 different compressions which are gathered by randomizing the step at which a node is selected among the set of minimum degree nodes.

One conclusion that can be drawn from Table

This experimental setup also suggests that the

Changes in degree distributions with compression

Even though the compression rates we achieve with

In order to understand the reason behind different compression rates for different compression levels, we examined the degree distributions of the ten organism-wide networks we have in our dataset. For each of these networks, we plotted the histogram of out-degree distributions for different levels of compression. Figure

Shift of out-degree distributions from power law to uniform

**Shift of out-degree distributions from power law to uniform.** Changes in the out-degree distributions of ten organism-wide metabolic networks with increasing levels of compression. We calculate the frequencies of each out-degree in the range [2,40] for

A more interesting observation is that there is a consistent shift from the power-law degree distribution to uniform distribution with increasing

Evaluation of running time and memory utilization

In order to understand the capabilities and limitations of our framework, we examine its performance in terms of its running time and memory utilization on a set of large scale networks we constructed as described in the dataset section. We have ten networks for each of the ten organisms in our dataset. For each organism, nine of these networks constitute different metabolism categories and the tenth network is the organism-wide metabolic network. In total, we have 100 networks with sizes ranging from 5 to 1615. For each parameter setting (different combinations of

Figure ^{i+5}, 2^{i+6}] where

Resource utilization of our framework

**Resource utilization of our framework.** The average (a) running time and (b) memory utilization of our framework when each query network in our large scale dataset is aligned with all the networks (including itself) in the same dataset. x-axis is the query size which is calculated as the product of the sizes (i.e., number of reactions) of the metabolic networks aligned.

For

Similar trend of improved running times with increasing ^{13}, 2^{14}] compressing the networks more than only one level (

Gain/Loss in running time

**Gain/Loss in running time.** Gain/Loss in running time of alignment by using our framework with respect to the base alignment method (x-axis) versus the ratio of the number of all possible subnetwork mappings in compressed domain to this number in the original domain. The blue vertical line shows when the two methods take exact same amount of time or when both methods take very short amount of time in the case of small query networks. Points on the right (left) handside of this line means gain (loss) in the running time. The dashed line is our decision criteria for predicting whether there will be gain or loss before doing the alignment.

An important aspect of our framework is that it makes possible to align networks that could not be aligned with our base method. For ^{17}, 2^{18}], 96 did not complete successfully for

Figure

Accuracy of the alignment results

We conclude our experimental results by answering the first question introduced earlier in the paper, that is "How does compression affect the alignment accuracy?". In order to answer this, we calculate the correlation between the scores of each possible mapping in compressed domain and the scores that we obtain for these mappings from the original SubMAP method. We consider the scores of each possible subnetwork mapping of compressed nodes found by our framework. Since the mappings found by SubMAP are not of the same form with the mappings in compressed domain, we calculate a score value for each mapping in compressed domain by using the scores of the mappings found by SubMAP in the original domain. This way, we get two sets of score values one from SubMAP one from our framework for the same set of mappings. We calculate the Pearson's correlation coefficient between these two sets of scores as an indicator of the similarity between the results of the two methods.

Before looking at the correlation values we found, it is important to describe how we calculate the score for a mapping in compressed domain from the mappings of SubMAP. Let ^{1 }and _{1 }is a subnetwork of _{1 }and _{1 }= {_{1}, _{2}}, ^{1}} for ^{1}|. This subnetwork mapping in compressed domain contains six possible mappings in the original, namely _{i}^{c}

Table

Correlation of the mapping scores found with and without compression

**
k/c
**

**1**

**2**

**3**

1

0.89

0.56

0.53

2

0.85

0.58

0.50

3

0.84

0.57

0.49

We calculate the Pearson's correlation coefficient between the two sets of score values one from SubMAP (without compression) one from our framework (with compression) and report it as an indicator of the accuracy of alignment results of our framework for different parameter settings.

Conclusions

In this paper, we considered the problem of aligning two metabolic networks particularly when both of them are too large to be dealt with using existing methods. To solve this problem, we developed a framework that scales the size of the metabolic networks that existing methods can align significantly. Our framework is generic as it can be used to improve the scalability of any existing network alignment method. It has three major phases, namely the

Methods

In this section, we describe the method we develop to compress the query networks and the overall framework for aligning networks in this compressed domain. Before going into detail, it is important to state that we are using a reaction-based model for representing metabolic networks throughout this paper. Formally, we represent a metabolic network with _{ij }_{j}

Minimum degree selection (

Let ^{x }^{x}^{x}^{0 }= ^{x }^{x }^{- 1 }for each ^{x }^{x - 1 }or a supernode that contains two nodes of ^{x - 1}. In summary, we construct ^{x }^{x - 1 }in a number of consecutive steps. At each step, we choose a pair of connected nodes in ^{x - 1 }that are not compressed in earlier steps of the current compression level. We then merge this node pair into a supernode and add it to ^{x}^{x - 1}. Assume that the number of such steps is ^{x - 1 }or supernodes from ^{x}

One compression step of the

**One compression step of the MDS method.** Small circles represent reactions and big circles represent supernodes that result from earlier steps of compression. A solid arrow represents an edge between two non-compressed nodes in the current compression level. A dashed arrow denotes an edge between a supernode and another node in the network. While calculating the degrees of the non-compressed nodes, only the solid arrows are taken into account. (a) The state of network

We are now ready to discuss how we compress ^{x - 1 }to get ^{x}_{a}_{a}_{a}_{a}_{b}_{a}_{b}_{a }_{b }_{ab }_{a}_{b}_{ab}_{a }_{b }_{ab }_{a }_{b }_{ab }_{ab}^{x-1 }to get ^{x }

The discussion above describes the intermediate compression steps of the ^{x - 1 }= (^{x - 1}, ^{x - 1}) by initially treating ^{x - 1 }as a non-compressed network with no supernodes. As a result of this process, after finishing the ^{x }^{x}

Optimality analysis for

In the previous section, we described in detail the compression method (

We start by introducing the notation we use in this section to handle networks with more than one connected component. Let ^{x }^{x}^{x}^{x}^{x}

In the following, we first prove that the

**Lemma 1 **

**Proof 1 **

_{a }with deg_{a}_{a}. The node v_{a }has exactly one non-compressed neighbor, say v_{b}. Thus, MDS merges them to create the supernode v_{ab }(see _{a }and v_{b }while compressing _{a }and v_{b }are compressed at any point in the optimal method, then the optimal solution for

_{a }and v_{b }are not merged together in the optimal solution. This case implies v_{a }is left as a singleton at the end of the xth level as deg_{a}_{a }and all the edges connected to it can have at most _{a }is left as a singleton cannot be greater than one plus

**Lemma 2 **

**Proof 2 **_{a }be the first node in the list of minimum degree nodes in _{a}_{b }that also has deg_{b}_{a }and v_{b }to create the supernode v_{ab }at the compression step from _{a}, say v_{c}, and at most one neighbor of v_{b}, say v_{d}, to be merged with the corresponding node in later steps. Notice that v_{c }and v_{d }are not necessarily distinct. The MDS algorithm can also merge v_{c }and v_{d }in the next steps if they are also neighbors though we do not know it for sure at this point. This results in either one compression or two compressions using only the four nodes v_{a}, v_{b}, v_{c }and v_{d }by the MDS method. Next, we calculate the number of compression steps that the OPT method can take for compressing these four nodes. There are three cases to consider:

Case 1. The _{a }_{a }with v_{b }in the next step by MDS and then compressing the rest of the network by O PT. In other words, MDS already takes the optimal compression step. Hence,

Case 2. The _{a }_{c }_{c }is not connected to v_{d }and the OPT method merges v_{b }with v_{d }in a later step. This way the OPT method optimally compresses four nodes down to two supernodes, namely v_{ac }and v_{bd}. On the other hand the MDS method creates a single supernode, v_{ab}, and the nodes v_{c }and v_{d }remain as singleton However, even for this worst case, the MDS method prevents only one compression step to take place with respect to O PT. Hence,

Case 3. The _{b }_{d }

□

Using lemmas 1 and 2, Theorem 1 develops an upper bound on the number of compression that can be missed by

**Theorem 1 **(O^{c }by the MDS method

^{x - 1}, *^{x}^{x - 1}, *^{x}

^{c}^{c}

^{c}^{c}^{c}

**Proof 3 **^{x - 1}^{x - 1 }

^{c}^{c}

^{c}^{c}^{c}^{c}^{1}) ≤ 2 ^{1})^{1}) ≤ 2 ^{1})^{c}^{c}

□

Another way of interpreting Theorem 1 is to transform it to an upper bound on the size of the compressed network generated by ^{c}_{O PT }_{MDS }_{O PT }

If we examine the ratio ^{x}

Alignment framework

We described the first phase, namely the compression phase in detail in previous sections. Here, we first summarize the base alignment method, SubMAP

Overview of SubMAP

Here, we take a small detour and explain SubMAP, a recent method for aligning metabolic networks when they are not compressed. We pick SubMAP method for its high accuracy and biological relevance as it considers subnetworks of the given networks during the alignment. A

The first step of SubMAP is to create the set of all possible subnetworks of size at most _{k }_{k }

The step that dominates the time and space complexity of SubMAP is the third step. The aim of this step is to create a similarity score that combines pairwise similarities with the topological similarity of the networks. A data structure named the _{k}^{2 }_{k}^{2}) space. This complexity is very important as it is the dominating factor in the overall time and space complexity of SubMAP. The next two steps of the algorithm are to combine topological similarity with pairwise node similarities and to extract the alignment as a set of subnetwork mappings of

Alignment phase

The SubMAP method described above aligns the networks ^{c }

Let us first consider ^{c }^{c}^{c}_{a }^{c }_{a }^{c }_{a }_{b }^{c }_{a }_{b }^{c }^{c}^{c }

Refinement phase

Each mapping found by the alignment phase is a subnetwork pair where one is from ^{c }^{c }^{c }^{x }^{x }

Figure ^{c }^{c }

Complexity analysis

Having finished the discussion of all the three phases, now we can analyze the overall complexity of our framework. We start from the first phase which is compression of the input networks ^{2}). Since the input sizes of this level is larger than all the next levels, we can safely assume that each of these next levels also take ^{2}) and the complexity of compression by ^{2}). Even though this is not a tight bound, it is sufficient at this point for the complexity of the next two phases will dominate it. Since we compress both networks, the overall complexity for the compression phase is:

For the analysis of the next phases, we make two assumptions both of which are supported by experimental evidence on the topological properties of metabolic networks. Our first assumption is that at each level of compression our method reduces the network size by half. In other words, if the sizes of our query networks are _{MDS }_{MDS }

We are now ready to analyze the complexity of the second phase which is the alignment phase. By the first assumption, we know that the sizes of ^{c }_{MDS }_{MDS }_{MDS }_{MDS}

The complexity of the refinement phase has two factors in it. The first one is the number of mappings found by the alignment phase. Since we know that SubMAP allows each node of both networks to be reported in at most one mapping, we have a trivial upper bound on the number of possible mappings in terms of ^{c }^{c }^{c }^{c }^{2 }^{2 }2^{2c }^{2 }2^{2c}). We do this refinement for

Combining the results of Equations 4, 5 and 6, we can see that the overall complexity of our method is determined by the second or the third phase depending on the value of

When should we compress?

We discussed the potential of our framework improving the scalability of existing network alignment methods. However, there can be cases when the compression results in such network topologies which would enforce the alignment method to reach its worst case performance. In this section, we want to analyze when performing the alignment in compressed domain is the better alternative. For this purpose, we devise a criterion that is inspired by the results of a large number of network alignments that are done by both of the methods. We find that the gain/loss in running time is highly dependent on the number of all possible subnetworks of compressed and non-compressed networks. The numbers of these subnetworks can be determined in advance to the alignment. By formulating a criterion in terms of these numbers, we can make a decision between the two algorithms before actually performing an alignment.

Figure _{k}_{k }^{c }

How much should we compress?

In this section, we provide a guideline for selecting a value for compression level

**Theorem 2 **(O

**Proof 4 **

□

The value obtained from the above discussion is not necessarily an integer. We suggest using the nearest integer to this value as the number of compression levels in our alignment. Next, we want to give a few examples for to see what Theorem 2 implies in practice. Assume we have two networks with sizes

If we round this to the nearest integer, the Equation 7 suggests that we use two levels of compression for this alignment problem to be able to get the largest gain in running time. We can carry the calculations similarly for a bigger set of inputs

However, it is important to note that depending on how much of a tradeoff is desired between the running time gain and the alignment accuracy, the user can always use smaller (or bigger)

List of abbreviations

_{i }^{c}^{c }^{c}^{c}_{a}_{a}_{a}_{a}_{a}_{ab }_{a }_{b}_{k}_{k}

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FA, TK, and MD developed the method. MD and FA implemented the methods and gathered experimental results. FA and TK wrote the paper.

Acknowledgements and funding

This work was supported partially by NSF under grants IIS-0845439 and CCF-0829867. FA is partially supported by NSF under grant #1136996 to the Computing Research Association for the CIFellows project.

This article has been published as part of