RProteinInteraction

RProteinInteraction: A package of network based protein function prediction algorithms in R
 

Description

Numerous algorithms have been developed to infer protein function based on observed or inferred protein interactions. Proper evaluation of the various algorithms has been inhibited by the lack of availability of corresponding implementations. In addition, those algorithms that have publicly available implementations require a variety of environments and languages. This further inhibits cross-comparison.

We have developed an R package consisting of a suite of algorithms which are representative of many direct methods for network based protein function prediction. An evaluation framework is provided to facilitate comparison of the various methods. The framework includes measures to calculate precision/recall and ROC curves using cross validation. Standard protein interaction databases such as DIP and MINT can easily be integrated into the framework.
 

Getting Started

Creating PPI Graphs

PSIXMLToIGraph creates an annotated iGraph graph for use by the function prediction algorithms. The function has the following syntax:

PSIXMLToIGraph(intDir = <PATH WHERE PSI-MI XML FILES ARE STORED>, 
intName = <NAME OF PSI-MI XML FILE>, type = <ONE OF "BIOGRID.PSIMI25", 
"DIP.PSIMI25", "HPRD.PSIMI25", "INTACT.PSIMI25", "MINT.PSIMI25", "MIPS.PSIMI25">, 
annotDir = <PATH WHERE GOA FILE IS STORED>, annotName = <NAME OF GOA FILE>, 
mapDir = <PATH WHERE gp2protein FILE IS LOCATED>, 
mapName = <NAME OF gp2protein FILE>, 
cluster = <BOOLEAN INDICATING WHETHER TO PARALLELIZE CODE>, 
clusterConnectionType = <ONE OF "SOCK", "MPI" OR "PVM">, 
numWorkers = <NUMBER SPECIFYING NUMBER OF NODES TO USE (USED BY PVM AND MPI ONLY)>, 
clusterNodeNames = <VECTOR OF CHARACTERS LISTING THE NAMES OF THE NODES IN THE CLUSTER>)

Example: Creating an annotated protein interaction graph from a single PSI-MI XML file without parallelization

myPPIGraph = PSIXMLToIGraph(intDir = getwd(), intName = "MyPSIMIXMLFile.xml", 
type = "BIOGRID.PSIMI25", annotDir = getwd(), annotName = "MyGOAFile.goa", 
mapDir = getwd(), mapName = "gp2protein.sgd", cluster = false)

Example: Creating an annotated protein interaction graph from several PSI-MI XML files in folder with parallelization

myPPIGraph = PSIXMLToIGraph(intDir = getwd(), intName = NULL, 
type = "BIOGRID.PSIMI25", annotDir = getwd(), annotName = "MyGOAFile.goa", 
mapDir = getwd(), mapName = "gp2protein.sgd", cluster = true, 
clusterConnectionType = "SOCK", clusterNodeNames = c("node1", "node1", 
"node2", "node3"))

When intName is set to NULL PSIXMLToIGraph loads in all the files in a directory. This means that the directory should only contain the PSI-MI XML files you want to import and no other files. If specifying the nodes the run the computation on, as is done with the "SOCK" method, repeating the node name will cause PSIXMLToIGraph to use an additional core on that node. In the above example two cores on node1 are used.

Using Network Based PPI Based Protein Function Prediction Algorithms

All algorithms have the following parameters:

MyPredictionAlgorithm(<ALGORITHM SPECIFIC PARAMETERS>, 
graph = <GRAPH TO RUN ALGORITHM ON>, 
performance = <BOOLEAN INDICATING WHETHER TO MEASURE 
               PERFORMANCE OF ALGORITHM ON CHARACTERIZED 
               PROTEINS>, 
sampleAmount = <NULL, A NUMBER OF PROTEINS OR A VECTOR 
                OF GRAPH VERTEX IDS>, 
removeDuplicates = <BOOLEAN TO SPECIFY WHETHER TO REMOVE 
                    DUPLICATE ANNOTATION FROM CHARACTERIZED,
                    PROTEINS>,
cluster = <BOOLEAN INDICATING WHETHER TO PARALLELIZE CODE>, 
clusterConnectionType = <ONE OF "SOCK", "MPI" OR "PVM">, 
numWorkers = <NUMBER SPECIFYING NUMBER OF NODES TO USE (USED BY PVM AND MPI ONLY)>, 
clusterNodeNames = <VECTOR OF CHARACTERS LISTING THE NAMES OF THE NODES IN THE CLUSTER>,
saveDir = <DIRECTORY WHERE RESULT CAN BE SAVED>, 
saveName = <NAME OF SAVED OUTPUT FILE>)

If sampleAmount is specified as NULL then the functions of all uncharacterised proteins are predicted if performance = FALSE or the functions of all characterised proteins are predicted if performance = TRUE.

Neighbour Counting Algorithm

The neighbour counting algorithm predicts a protein's function based on the frequency with which its neighbouring proteins in the graph have this function. The specific parameter for the neighbour counting algorithm is threshold which specifies the minimum confidence value that a term must have to be included in the returned result.

Example: Testing the performance of the algorithm on all characterized proteins without parallelization

library("RProteinInteraction")
data(smallgraph)
result = Schwikowski(threshold = 0.0, graph = graph, performance = TRUE, 
sampleAmount = NULL, removeDuplicates = TRUE, cluster = FALSE) 

Chi-Squared Algorithm

The Chi-squared algorithm is similar to the neighbour counting algorithm except that a term's freqency is compared against the overall frequency of the term in the graph so as to determine whether the frequency is statistically significant. Its specific parameters are threshold, which specifies the minimum Chi-squared value that a prediction can have to be returned in the result, and order, which specifies the number of hops a protein can be away from another protein in the graph to be considered in its neighbourhood.

Example: Testing the performance of the algorithm on all characterized proteins without parallelization

library("RProteinInteraction")
data(smallgraph)
result = Hishigaki(threshold = 0.0, order=1, graph = graph, performance = TRUE, 
sampleAmount = NULL, removeDuplicates = TRUE, cluster = FALSE) 

Common Neighbour Algorithm

The Common Neighbour algorithm is based on the idea that a term that appears in interacting neighbours that are one and two hops away from the test protein are more likely to correctly characterize the protein. This algorithm has only one specific parameter called threshold which specifies the minimum confidence that a prediction must have in order that it be returned in the result.

Example: Testing the performance of the algorithm on all characterized proteins without parallelization

library("RProteinInteraction")
data(smallgraph)
result = Chua(threshold = 0.0, graph = graph, performance = TRUE, 
sampleAmount = NULL, removeDuplicates = TRUE, cluster = FALSE) 

Maximum Likeihood Neighbourhood Algorithm

A probabilistic model is constructed of how the frequency of a term in a protein's neighbourhood relates to whether that protein is annotated by that term or not. The parameters of this model are estimated based on the characterised subgraph of the underlying PPI network. Gibbs sampling is then used to estimate the probability that an uncharacterised protein has a given term.

The specific parameters of the model are threshold and numIter which specifies the number of iterations that the Gibbs sampler will go through.

Example: Testing the performance of the algorithm on all characterized proteins without parallelization

library("RProteinInteraction")
data(smallgraph)
result = Deng(threshold = 0.0, numIter = 2000, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)

Maximise the Number of Interacting Proteins Sharing Terms Algorithm

Uncharacterised proteins are assigned a term. These terms are perturbed using simulated annealing as as to maximise the number of pairs of proteins in the graph sharing a common term. The probability that a protein has a given term is estimated based on the number of times the simulated annealing algorithm assigns the term to the protein. The algorithm has a number of specific parameters. numIter refers to the number of times the simulated annealing algorithm is ran. startTemp refers to the start temperature at which the annealing algorithm starts at. decrement refers to the amount by which the temperature is decremented by in the annealing algorithm.

Because a suitable start temperature and decrement will vary from graph to graph additional parameters are available so as to aid exploration of the appropriate values that these parameters should have. maxIter limits the number of iterations the annealing algorithm can stay at a given temperature. stationaryThreshold refers to the maximum number of term changes the annealing algorithm can make in order to consider the annealing algorithm has stabilised at a given temperature. logDirAndName can be used to specify the name and location of a log file logging the algorithm's execution.

Example: Testing the performance of the algorithm on all characterized proteins without parallelization

library("RProteinInteraction")
data(smallgraph)
result = Vazquez(numIter = 200, startTemp = 10, decrement = 0.01,
maxIter = 1000, stationaryThreshold = 20,
logDirAndName="./Vazquez.Rdata", graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)

Functional Flow Algorithm

Proteins annotated with a given term are modelled as sources of flow in the PPI graph. Each edge in the graph has an associated capacity which limits the amount of flow that can pass through it. Flow is passed through the network for a fixed number of iterations. The level of confidence that a protein has a given term is based on the total amount of assocciated flow that has passed though that protein over all iterations.

The specific parameters are threshold and numIters which refers to the number of iterations that flow is allowed to pass from one protein to another in the graph.

Example: Testing the performance of the algorithm on all characterized proteins without parallelization

library("RProteinInteraction")
data(smallgraph)
result = Nabieva(threshold = 0.0, numIter = 6, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)

Generating a Precision/Recall Curve

Precision/recall curves are a standard measure of classifiers in the area of information retrieval. We provide methods to calculate the overall level of precision and recall of prediction on a graph and another measure which describes the expected level of precision and recall per protein.

Example: Generating a precision/recall curve

library("RProteinInteraction")
data(smallgraph)
result = Nabieva(threshold = 0.0, numIter = 6, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
precRecallCurve = OverallPrecisionAtConfidenceThreshold(graph, result, numSteps = 10)
plot(precRecallCurve$Recall, precRecallCurve$Precision)

Generating an ROC Curve

An algorithm's performance can be measured by the number of terms whose ROC curve area is greater than a given value. This performance measure is based on the measure described in this paper.

Example: Calculating the number of terms whose ROC curve is greater than a given value

library("RProteinInteraction")
data(smallgraph)
result = Nabieva(threshold = 0.0, numIter = 6, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
rocCurve = GetNumTermsVsROCAUC(graph, result, numSteps = 100)
plot(rocCurve$Threshold, rocCurve$Area)

Downloads

Note: Load data into an R environment using the "load(...)" function

Data: A small annotated interaction network (< 20 vertices)

Download Small Example Graph

Data: An annotated graph based on MINT protein interaction data for S. Cerevisiae annotated with terms from SGD's GOA file

Download MINT S. Cerevisiae Graph

Data: An annotated graph based on MINT protein interaction data for H. Sapiens annotated with terms from the Gene Ontology Consortium's GOA file

Download MINT H. Sapiens Graph

Data: An annotated graph based on MINT protein interaction data for D. Melanogaster annotated with terms from the Gene Ontology Consortium's GOA file

Download MINT D. Melanogaster Graph

Software: An R package called RProteinInteraction is provided for non-commercial use. Documentation and sample files are provided in the package and can be accessed using the standard approaches in R i.e. "data(<NAME OF FILE>, package = "RProteinInteraction")" to access sample data and "?" for help on the functions available in RProteinInteraction. To install use the commands:

tar -xvf RProteinInteraction*.gz

R CMD INSTALL --library=<WHERE TO INSTALL PACKAGE> RProteinInteraction

Download RProteinInteraction

Documentation

Lists the accessible methods and functions in RProteinInteraction

Download Documentation

Contact

For further information regarding the software and data, please contact Brendan Sheehan.

Social Network Analysis Group

Search

User login

Subscribe to Clique

Sponsors