In Bioinformatics, a key issue in sequence analysis is the determination of similarity between biological sequences, that is the percentage of sequence matches among nucleotide or protein sequences. The hypothesis is that similarity relates to functionality: if two sequences are similar, they will have related functionalities. Sequence databases are exploding in size, growing at an exponential rate, currently doubling in about 12 months, exceeding the rate of growth in compute cycles that doubles only every 18 months (Moore’s law). Whenever sequence databases are significantly updated, alignment should be repeated many times to discover new information. Grid computing together with parallel alignment tools, wrapped as Web Services, are crucial techniques to maintain and improve the effectiveness of sequence comparison tools. Partitioning the load into different jobs for each simulation is a good choice for the alignment applied to a large dataset, because the input data, composed by a set of sequences (generally in FastA format), are compared with biological databases that are made of a set of sequences in various formats (flat files and FastA). Therefore, it is possible to parallelize the execution by splitting both the input dataset and/or the database, sending each data partition to a Grid node, and merging the results. The solution proposed in this paper is based on the possibility to manage the untapped processing power of desktop PCs within an enterprise Grid network to process computationally intensive jobs for scientific applications and in particular for the bioinformatics domain. This paper describes BioGAT, Bioinformatics Grid Alignment Toolkit, that offers optimized brokering and a data management system to exploit various bioinformatics alignment tools wrapped as Web Services in a Grid architecture.
A Bioinfomatics Grid Alignment Toolkit
EPICOCO, Italo;CAFARO, Massimo;ALOISIO, Giovanni
2008-01-01
Abstract
In Bioinformatics, a key issue in sequence analysis is the determination of similarity between biological sequences, that is the percentage of sequence matches among nucleotide or protein sequences. The hypothesis is that similarity relates to functionality: if two sequences are similar, they will have related functionalities. Sequence databases are exploding in size, growing at an exponential rate, currently doubling in about 12 months, exceeding the rate of growth in compute cycles that doubles only every 18 months (Moore’s law). Whenever sequence databases are significantly updated, alignment should be repeated many times to discover new information. Grid computing together with parallel alignment tools, wrapped as Web Services, are crucial techniques to maintain and improve the effectiveness of sequence comparison tools. Partitioning the load into different jobs for each simulation is a good choice for the alignment applied to a large dataset, because the input data, composed by a set of sequences (generally in FastA format), are compared with biological databases that are made of a set of sequences in various formats (flat files and FastA). Therefore, it is possible to parallelize the execution by splitting both the input dataset and/or the database, sending each data partition to a Grid node, and merging the results. The solution proposed in this paper is based on the possibility to manage the untapped processing power of desktop PCs within an enterprise Grid network to process computationally intensive jobs for scientific applications and in particular for the bioinformatics domain. This paper describes BioGAT, Bioinformatics Grid Alignment Toolkit, that offers optimized brokering and a data management system to exploit various bioinformatics alignment tools wrapped as Web Services in a Grid architecture.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.