RProteinInteraction
RProteinInteraction: A package of network based protein function prediction algorithms in R
Description
Numerous algorithms have been developed to infer protein function based on observed or inferred protein interactions. Proper evaluation of the various algorithms has been inhibited by the lack of availability of corresponding implementations. In addition, those algorithms that have publicly available implementations require a variety of environments and languages. This further inhibits cross-comparison.
We have developed an R package consisting of a suite of algorithms which are representative of many direct methods for network based protein function prediction. An evaluation framework is provided to facilitate comparison of the various methods. The framework includes measures to calculate precision/recall and ROC curves using cross validation. Standard protein interaction databases such as DIP and MINT can easily be integrated into the framework.
Getting Started
Creating PPI Graphs
PSIXMLToIGraph creates an annotated iGraph graph for use by the function prediction algorithms. The function has the following syntax:
PSIXMLToIGraph(intDir = <PATH WHERE PSI-MI XML FILES ARE STORED>, intName = <NAME OF PSI-MI XML FILE>, type = <ONE OF "BIOGRID.PSIMI25", "DIP.PSIMI25", "HPRD.PSIMI25", "INTACT.PSIMI25", "MINT.PSIMI25", "MIPS.PSIMI25">, annotDir = <PATH WHERE GOA FILE IS STORED>, annotName = <NAME OF GOA FILE>, mapDir = <PATH WHERE gp2protein FILE IS LOCATED>, mapName = <NAME OF gp2protein FILE>, cluster = <BOOLEAN INDICATING WHETHER TO PARALLELIZE CODE>, clusterConnectionType = <ONE OF "SOCK", "MPI" OR "PVM">, numWorkers = <NUMBER SPECIFYING NUMBER OF NODES TO USE (USED BY PVM AND MPI ONLY)>, clusterNodeNames = <VECTOR OF CHARACTERS LISTING THE NAMES OF THE NODES IN THE CLUSTER>)
Example: Creating an annotated protein interaction graph from a single PSI-MI XML file without parallelization
myPPIGraph = PSIXMLToIGraph(intDir = getwd(), intName = "MyPSIMIXMLFile.xml", type = "BIOGRID.PSIMI25", annotDir = getwd(), annotName = "MyGOAFile.goa", mapDir = getwd(), mapName = "gp2protein.sgd", cluster = false)
Example: Creating an annotated protein interaction graph from several PSI-MI XML files in folder with parallelization
myPPIGraph = PSIXMLToIGraph(intDir = getwd(), intName = NULL,
type = "BIOGRID.PSIMI25", annotDir = getwd(), annotName = "MyGOAFile.goa",
mapDir = getwd(), mapName = "gp2protein.sgd", cluster = true,
clusterConnectionType = "SOCK", clusterNodeNames = c("node1", "node1",
"node2", "node3"))
When intName is set to NULL PSIXMLToIGraph loads in all the files in a directory. This means that the directory should only contain the PSI-MI XML files you want to import and no other files. If specifying the nodes the run the computation on, as is done with the "SOCK" method, repeating the node name will cause PSIXMLToIGraph to use an additional core on that node. In the above example two cores on node1 are used.
Using Network Based PPI Based Protein Function Prediction Algorithms
All algorithms have the following parameters:
MyPredictionAlgorithm(<ALGORITHM SPECIFIC PARAMETERS>,
graph = <GRAPH TO RUN ALGORITHM ON>,
performance = <BOOLEAN INDICATING WHETHER TO MEASURE
PERFORMANCE OF ALGORITHM ON CHARACTERIZED
PROTEINS>,
sampleAmount = <NULL, A NUMBER OF PROTEINS OR A VECTOR
OF GRAPH VERTEX IDS>,
removeDuplicates = <BOOLEAN TO SPECIFY WHETHER TO REMOVE
DUPLICATE ANNOTATION FROM CHARACTERIZED,
PROTEINS>,
cluster = <BOOLEAN INDICATING WHETHER TO PARALLELIZE CODE>,
clusterConnectionType = <ONE OF "SOCK", "MPI" OR "PVM">,
numWorkers = <NUMBER SPECIFYING NUMBER OF NODES TO USE (USED BY PVM AND MPI ONLY)>,
clusterNodeNames = <VECTOR OF CHARACTERS LISTING THE NAMES OF THE NODES IN THE CLUSTER>,
saveDir = <DIRECTORY WHERE RESULT CAN BE SAVED>,
saveName = <NAME OF SAVED OUTPUT FILE>)
If sampleAmount is specified as NULL then the functions of all uncharacterised proteins are predicted if performance = FALSE or the functions of all characterised proteins are predicted if performance = TRUE.
Neighbour Counting Algorithm
The neighbour counting algorithm predicts a protein's function based on the frequency with which its neighbouring proteins in the graph have this function. The specific parameter for the neighbour counting algorithm is threshold which specifies the minimum confidence value that a term must have to be included in the returned result.
Example: Testing the performance of the algorithm on all characterized proteins without parallelization
library("RProteinInteraction")
data(smallgraph)
result = Schwikowski(threshold = 0.0, graph = graph, performance = TRUE,
sampleAmount = NULL, removeDuplicates = TRUE, cluster = FALSE)
Chi-Squared Algorithm
The Chi-squared algorithm is similar to the neighbour counting algorithm except that a term's freqency is compared against the overall frequency of the term in the graph so as to determine whether the frequency is statistically significant. Its specific parameters are threshold, which specifies the minimum Chi-squared value that a prediction can have to be returned in the result, and order, which specifies the number of hops a protein can be away from another protein in the graph to be considered in its neighbourhood.
Example: Testing the performance of the algorithm on all characterized proteins without parallelization
library("RProteinInteraction")
data(smallgraph)
result = Hishigaki(threshold = 0.0, order=1, graph = graph, performance = TRUE,
sampleAmount = NULL, removeDuplicates = TRUE, cluster = FALSE)
Common Neighbour Algorithm
The Common Neighbour algorithm is based on the idea that a term that appears in interacting neighbours that are one and two hops away from the test protein are more likely to correctly characterize the protein. This algorithm has only one specific parameter called threshold which specifies the minimum confidence that a prediction must have in order that it be returned in the result.
Example: Testing the performance of the algorithm on all characterized proteins without parallelization
library("RProteinInteraction")
data(smallgraph)
result = Chua(threshold = 0.0, graph = graph, performance = TRUE,
sampleAmount = NULL, removeDuplicates = TRUE, cluster = FALSE)
Maximum Likeihood Neighbourhood Algorithm
A probabilistic model is constructed of how the frequency of a term in a protein's neighbourhood relates to whether that protein is annotated by that term or not. The parameters of this model are estimated based on the characterised subgraph of the underlying PPI network. Gibbs sampling is then used to estimate the probability that an uncharacterised protein has a given term.
The specific parameters of the model are threshold and numIter which specifies the number of iterations that the Gibbs sampler will go through.
Example: Testing the performance of the algorithm on all characterized proteins without parallelization
library("RProteinInteraction")
data(smallgraph)
result = Deng(threshold = 0.0, numIter = 2000, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
Maximise the Number of Interacting Proteins Sharing Terms Algorithm
Uncharacterised proteins are assigned a term. These terms are perturbed using simulated annealing as as to maximise the number of pairs of proteins in the graph sharing a common term. The probability that a protein has a given term is estimated based on the number of times the simulated annealing algorithm assigns the term to the protein. The algorithm has a number of specific parameters. numIter refers to the number of times the simulated annealing algorithm is ran. startTemp refers to the start temperature at which the annealing algorithm starts at. decrement refers to the amount by which the temperature is decremented by in the annealing algorithm.
Because a suitable start temperature and decrement will vary from graph to graph additional parameters are available so as to aid exploration of the appropriate values that these parameters should have. maxIter limits the number of iterations the annealing algorithm can stay at a given temperature. stationaryThreshold refers to the maximum number of term changes the annealing algorithm can make in order to consider the annealing algorithm has stabilised at a given temperature. logDirAndName can be used to specify the name and location of a log file logging the algorithm's execution.
Example: Testing the performance of the algorithm on all characterized proteins without parallelization
library("RProteinInteraction")
data(smallgraph)
result = Vazquez(numIter = 200, startTemp = 10, decrement = 0.01,
maxIter = 1000, stationaryThreshold = 20,
logDirAndName="./Vazquez.Rdata", graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
Functional Flow Algorithm
Proteins annotated with a given term are modelled as sources of flow in the PPI graph. Each edge in the graph has an associated capacity which limits the amount of flow that can pass through it. Flow is passed through the network for a fixed number of iterations. The level of confidence that a protein has a given term is based on the total amount of assocciated flow that has passed though that protein over all iterations.
The specific parameters are threshold and numIters which refers to the number of iterations that flow is allowed to pass from one protein to another in the graph.
Example: Testing the performance of the algorithm on all characterized proteins without parallelization
library("RProteinInteraction")
data(smallgraph)
result = Nabieva(threshold = 0.0, numIter = 6, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
Generating a Precision/Recall Curve
Precision/recall curves are a standard measure of classifiers in the area of information retrieval. We provide methods to calculate the overall level of precision and recall of prediction on a graph and another measure which describes the expected level of precision and recall per protein.
Example: Generating a precision/recall curve
library("RProteinInteraction")
data(smallgraph)
result = Nabieva(threshold = 0.0, numIter = 6, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
precRecallCurve = OverallPrecisionAtConfidenceThreshold(graph, result, numSteps = 10)
plot(precRecallCurve$Recall, precRecallCurve$Precision)
Generating an ROC Curve
An algorithm's performance can be measured by the number of terms whose ROC curve area is greater than a given value. This performance measure is based on the measure described in this paper.
Example: Calculating the number of terms whose ROC curve is greater than a given value
library("RProteinInteraction")
data(smallgraph)
result = Nabieva(threshold = 0.0, numIter = 6, graph = graph,
performance = TRUE, sampleAmount = NULL, removeDuplicates = TRUE,
cluster = FALSE)
rocCurve = GetNumTermsVsROCAUC(graph, result, numSteps = 100)
plot(rocCurve$Threshold, rocCurve$Area)
Downloads
Note: Load data into an R environment using the "load(...)" function
Data: A small annotated interaction network (< 20 vertices)
Data: An annotated graph based on MINT protein interaction data for S. Cerevisiae annotated with terms from SGD's GOA file
Download MINT S. Cerevisiae Graph
Data: An annotated graph based on MINT protein interaction data for H. Sapiens annotated with terms from the Gene Ontology Consortium's GOA file
Download MINT H. Sapiens Graph
Data: An annotated graph based on MINT protein interaction data for D. Melanogaster annotated with terms from the Gene Ontology Consortium's GOA file
Download MINT D. Melanogaster Graph
Software: An R package called RProteinInteraction is provided for non-commercial use. Documentation and sample files are provided in the package and can be accessed using the standard approaches in R i.e. "data(<NAME OF FILE>, package = "RProteinInteraction")" to access sample data and "?" for help on the functions available in RProteinInteraction. To install use the commands:
tar -xvf RProteinInteraction*.gz
R CMD INSTALL --library=<WHERE TO INSTALL PACKAGE> RProteinInteraction
Documentation
Lists the accessible methods and functions in RProteinInteraction
Contact
For further information regarding the software and data, please contact Brendan Sheehan.

Clique News Feed