Tuesday, June 4, 2019

Ensuring All Stages Pipelining and Accuracy in PASQUAL

Ensuring All Stages Pipelining and Accuracy in PASQUALNachiket D. MoreAbstractGENOME is term apply for genetic material of organism. It is used to encode desoxyribonucleic acid of organisms, or ribonucleic acid of various kinds of viruses. Ii contains some(pre zero(prenominal)inal) coding and non coding parts of DNA/RNA. Now a days GENOME is constructed for mostly all animals, viruses, and bacterias. These tuition is mostly used in medical research and as well as to predict disease like cancer, HIV and many a nonher(prenominal) more(prenominal).GENOME is consisting of reads, these reads be in truth large in amount to manipulate and also to store and maintains. Sequencing machine produce output of short everyplacelapping substrings, these substring are called reads. The succession concourse reconstructs genome while of these reads. These genome sequences are long and continuous. fictionalisation software for Nest genesis Sequencing (NGS) must be a very accurate, fast and submit a less remembering consumption.PASQUAL is tool used for faster land of NGS GENOME host. For address challenges of NGS assembly, parallel algorithm and compressed data structure are used in PASUQAL. PASQUAL delivers better pep pill of execution, less shop consumption and better solution quality.Keywords Parallel algorithm, parallel suffix array construction, high performance bioinformatics, de novo sequence assembly, shared memory parallelism, DNA sequence, genome assembly.IntroductionThe term genome is used for represent/refer as cellular instruction set. Also it used to refer genetic material of a cell. A genome consist of chromosomes, it can be one or more individual chromosomes. Chromosomes consist of deoxyribonucleic acid (DNA), and for many viruses it consists of ribonucleic acid (RNA). DNA is made from simple social unit called nucleotides (nt). Nucleotides having four types namely A, C, G, and T. In sequence start and end are denoted by 5 and 3 respectively.Ded ucing the order of nucleotides from cell and encoding it as a string of letters is called a DNA sequencing process. This process cannot read whole sequence continuously, so it breaks DNA molecules into small part, which is used in chemical chemical reaction as templates to produce short sub-sequences called reads. Major problem is a reconstruct the original genome sequence from reads. For these purpose GENOME assembly algorithms are used. A GENOME assembly uses many automated rounds to improvements, but it inspected and edited by specialists. Assembling reads into a long contiguous sequence is called contigs.The genome sequencing is process of reading sequence of animal foot pairs (bp). Organism genome consists of base pairs, which is derived from two stranded of complementary bases. This is a main part to the study of genomes in bioinformatics. Except Whole Genome scattergun (WGS) sequencing machine, no other current sequencing method is capable to read whole sequence in one pas s. De novo assembly not uses any reference sequence aids to reconstruction of original sequence, because of these it is used in PASQUAL.We have to generate a large number of reads in a small amount of time, for these purpose we used a coterminous Generation Sequencing (NGS) technologies. Due to these it greatly switch offs the experimental cost per base. It helps to study organism at genome level, to deeply understanding of biological mechanism and genome regulation. Due to sequencing genome rapidly, it helps researchers to study more on evolution of viruses and bacteria. Because, bacteria and viruses can adopt behavior more easily also generate mutation easily at every step of re overlapion.Next Generation Sequencings (NGS)Decoding DNA sequences is essential in all branches of biological research. For these purpose scientist uses the capillary electrophoresis (CE) based Sanger sequencing, scientists able to manifest genetic information for any biological system. Because of these it is adopted by many research laboratories. But it has many limitations like throughout, scalability, speed and resolution to preclude in scientists research study.To reduce from these problem, these is new technology is introduced namely as Nest-Generation Sequencing (NGS), that become a reason for boost in research area in bioinformatics and genomic science. NGS is responsible for major break in path of retrieving information biological system, genome and epigenome of species. This gives an important breakthrough in fields like human disease and agriculture research.The principle behind NGS is sympathetic to CE. CE generates small fragments of DNA. These fragments are sequentially identified from each fragment, which is re-synthesized from DNA template. NGS perform similar work in parallel fashion, which is population of millions of reaction rather than superstar or few DSN fragments. Due to this NGS produces hundreds of gigabases of data in single pass/sequencing run.NGS p erform its operation as a single genomic DNA is first off fragmented into numbers of small segments, which is also know as library of segments. These segments are uniformly and accurately sequenced in millions of parallel reactions. These strings of bases are called as reads. Then these reads are reassembled by tow technique, first is known reference genome called as scaffold (re-sequencing) and second is without any reference genome (de novo sequencing). The output is set of reorient reads represents entire sequence of each chromosome in the gDNA.Fig. Conceptual Overview of Whole-Genome SequencingExtracted gDNA.gDNA is fragmented into a library of small segments that are each sequenced in paralllel.Individual sequence reads are reassembled by aligning to a reference genome.The Wholegenome sequence is derived from the consensus of aligned reads.NGS output is increased as a rate that outpaces Moors law. A single pass can produce up to one gigabase (Gb) of data, at the time of inve ntion i.e. in 2007. At 2011 it reaches up to terabase (Tb) of data in single pass/sequencing run. i.e. almost 1000 increase in four years. Because of this ability of NGS, researchers can move from idea to extensive data sets in few hours or days. Using CE technology sequencing of human genome takes a time around 10 years. But using NGS we can generate five human genomes at a single run. So it reduces the cost of genome projects.In NGS we can tune resolution of genome experiments. It is possible to produce more or less data, also it comport zoom in particular regions of genome with high resolution or view with low resolution but it is more expansive. To do these researchers can tune reportage generated in experiments. This ability gives number of experimental design advantages.Because of various advantages of NGS has permeated in many areas of study. Using NGS, researchers can develop a broad range of act that transformed study designs and determination new information never befo re imaginable.PASQUALPASQUAL can produce large data in assembly process in terms of memory consumption and running time. PASQUAL stands for PArallel SeQUence AssembLer. It uses OpenMP for shared memory parallelism, because of its good working between programmer productivity and performance. PASQUAL uses OLC approach and carry high quality solutions with combination of tailored algorithms.PASQUAL can handle billions of bases. It uses de novo assembly, because of it does not need any reference to produce original sequence. Algorithm constructs biological sequences in parallel by suffix array, and it is good key for parallel performance and memory optimization. Index stage and string graph construction is used for finding overlaps. Misassembles of genome sequence by PASQUAL is significantly less than ny other assemblers.PASQUAL can handle billion of bases in less time, because it uses pipelined stages and compressed data. It has advantages over SOAPdenovo and k-mer like SOAPdenovo is only a tool having comparable speed and k-mer is restricted to smaller length than 128. Rather than PASQUAL produces less errors compared to any other tool.4. Literature Survey4.1 De Novo Genome successiveness AssemblyIn year 2008 to 2012 these are many sequencing techniques are developed, due to these in that location is major drop in cast from 1/100000th to 1/100000th of price. De novo algorithm is inherited from the SOAPdenovo2 framework. De novo sequencing involves sweet genome it requires specific assembly of reads (sequencing reads). It requires unique combination of length, depth of reads also it requires flexible paired-end insert size. Unpatrolled raw read makes confident and efficient production and long contig assemblies. De novo sequencing assembly is preferred for study of non-model organisms, because it is cheaper and easier to construct a genome.The reference-based assembly uses mapping on to reference genome, because of these it has inability to account for incide nts of structural alteration of mRNA transcript. De novo assembly provides marrow to unwrap new and unknown sequence in biological research. Reading of whole sequence at once is limited, de novo methods are irreplaceable. It mostly used to discover new and unknown sequences, which is important in biodiversity in world.4.2 Overlap/Layout/Consensus (OLC) ApproachOverlap Layout Consensus (OLC) method is used in de novo assembly. It has a three steps overlap, layout and consensus respectively. In overlap stage graph is constructed, graph is made up of basic assembly. In layout stage this given graph is compressed. And in the consensus stage upon graph data, genome sequence is determined. These data is generated in previous two stapes.Overlap-In the overlap stage, each and every reads are compared with every other read, and these is perform in both direction forward and reverse complement orientations. It is very time consuming procedure especially in set of large reads.Layout-Finding path in OLC graph in not an easy task, because it has million of nodes and edges, and it very tedious task to find path that visit each node exactly ones. In this stage it OLC assembly graph is simplified, where assembly graph (i.e. segments) are compressed into contigs.Consensus-This is a final stage of OLC approach, at this step assembly graph is trim back to large scaffolds i.e. single scaffold. It start from left most read of each scaffold, OLC algorithm computes consensus of all the reads composing each scaffold. Gaps in the genome may still be presents if the consensus step had insufficient mate-pair or repeat contig information. If an assembly had gaps, it would will in a fragmented genome, composed of multiple scaffolds because the gaps between the scaffolds could not be joined.4.3 Shotgun SequencingSanger DNA sequencing technique work on limited distance in sequencing primer from 30 to 350 nt i.e. read length. Because of chain termination very few product can produce chai n. These work at best ability to sequence maybe 500 bases a day and it is infeasible for human genome which have billions of bases. other approach is, first divide DNA in to smaller fragments which is individually sequenced. Then these fragments are reassembled into original form based on overlaps. This strategy is known as shotgun sequencing, it also known as shotgun copy.In shotgun sequencing, it randomly sheared into small pieces (usually about 1kb) and sub cloned into universal cloning vector. The library of sub fragments is sampled at random, and sequence reads are generated. These reads are assembled into contig. From this procedure complete sequence of clone generated. Shotgun technique can identify gaps (i.e. there is no sequence available) and single standard regions (where there is sequence for only one stand). They are targeted for additional sequencing to produce fill sequenced module.5. Full Stage Pipelining and trueness in PASQUAL5.1 Motivation for this topicWith an e xplosive growth of genome research area and in genome sequencing data, there is huge demand for tool and systems that enables researchers to more efficiently and more effectively work. NGS technology can produce shorter reads as compared to previous sequencing and delivers higher coverage. Coverage means ratio of total length of reds to genome length. Typically NGS generates reads from millions to few billion. This result is depending upon genome size and coverage. Due to high improvements in technologies, data sets to grow larger. As well as assembly become more demanding in time and memory consumption.5.2 Selected areaIn NGS mainly contains DNA and RNA sequencing. I studied research paper for genome sequencing techniques. Genome sequencing techniques changes rapidly and become more and more advance over the period of time. Now a days genome sequencing is not used for research area also in treatments of many diseases.I am choosing full stage pipeline and more accuracy in PASQUAL be cause today many bioinformatics research topics uses genome sequencing, also it used for research topic in biodiversities. I have studied lots of paper where NGS is suggested for genome sequencing. I used full stage pipelining and more accuracy in PASQUAL NGS genome sequencing.6. Problem statementPurpose of these research work is make full stage pipelining and more accuracy in PASQUAL genome sequencing.7. Proposed SolutionThis system is completely new and it has different techniques to make it efficient for genome sequencing. Currently PASQUAL is not offering full all stages pipelining. Also scaffolding and support of paired-end reads uses third-party tools. It has to be improved error correction. Also acceleration in assembly process and reduce memory consumption.8. Work done till TodayStudy of different types of feature PASQUAL.Code for different sequence assembler techniques.Study of different sequencing and assembly algorithms.9. ObjectivesApplying full stage pipelining in all s tages of PASQUAL.Improving error correctionAccelerate the assembly process.Reduce memory consumption.10. ReferencesPASQUAL Parallel Techniques for Next Generation Genome Sequence Assembly by Xing Liu, Student Member, IEEE, Pushkar R. Pande, Henning Meyerhenke, and David A. Bader, Fellow, IEEE.B.H. Bloom, Space/Time Trade-Offs in Hash Coding with Allowable Errors, Comm. ACM, vol. 13, pp. 422-426, 1970.D. Bryant, W. Wong, and T. Mockler, QSRAA Quality-Value Guided de Novo Short Read Assembler, BMC Bioinformatics, vol. 10, no. 1, p. 69, 2009.J. Butler, I. MacCallum, M. Kleber, I.A. Shlyakhter, M.K. Belmonte, E.S. Lander, C. Nusbaum, and D.B. Jaffe, ALLPATHS De Novo Assembly of hole-Genome Shotgun Microreads, GenomeResearch, vol. 18, no. 5, pp. 810-820, 2008.H. Dinh and S. Rajasekaran, A Memory-Efficient Data Structure Representing Exact-Match Overlap Graphs with Application for Next-Generation DNA Assembly, Bioinformatics, vol. 27, pp. 1901-1907, 2011.J. Dohm, C. Lottaz, T. Borodina, a nd H. Himmelbauer, SHARCGS, A Fast and Highly Accurate Short-Read Assembly Algorithm for de Novo Genomic Sequencing, Genome Research, vol. 17, no. 11, pp. 1697-1706, 2007.U. Manber and G. Myers, Suffix Arrays A New Method for OnLine String searches, Proc. First Ann. ACM-SIAM Symp. DiscreteAlgorithms, pp. 319-327, 1990.www.wikipedia.com

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.