|
|
SNPNB 1.01
Introduction
SNPNB is a user-friendly application for analyzing Single Nucleotides Polymorphism NeighBoring compositions. The current version is SNPNB 1.01. It has two main functions: analysis of SNP neighboring-nucleotide patterns and evaluation of the effective SNP size for the observed patterns. The software was implemented byJava and Perl.
System Requirements
Operating system: Windows or Linux. It is possible to work in Unix too.
Memory: 512 MB minimum (1 GB is recommended for large size data).
Hard disk: 2 times the size of source data.
Running environment: JRE 1.4 or higher (JRE 1.4 is recommended), and Perl 5.6 or higher ( Perl 5.8 is recommended).
(Installation instructions for JRE 1.4.2 and Perl 5.6/5.8 are available here)
Downolad
Program: Windows: snpnb.zip; Linux or Unix: snpnb.tgz (updated 2005-02-02)
Test data:
Human Chr22 dbSNP build 121 (58MB) ;
human dbSNP build 121 (3.8GB) ;
mouse dbSNP build 121 (69MB) ; sample fasta format data
with SNP ancestor information (168k)
Fasta format used in dbSNP (6KB); Celera (4KB)
Instruction
- Download snpnb.zip (for windows) or snpnb.tgz (for Linux or Unix)
- Uncompress the downloaded file
Windows: use winzip to uncompress.
Linux or Unix: type "tar -xzvf snpnb.tgz".
- Start SNPNB program
Open a console window. Go to the directory "snpnb" and find the file "snpnb.pl". Type "perl snpnb.pl" to start SNPNB.
A program main menu window will popup.
- Set parameters in the main menu
- "Analysis menu"
There are two options. The user may select one by clicking the checkbox.
- Option 1: "Neighboring Effect Analysis "
Choose to analyze "Unknown Mutation Direction" or "Known Mutation Direction" SNP data. The program sets the default values of "Sites or Ranges" as 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 250. The user may change these values, but it requires the numbers in an increasing order.
- Option 2: "Sample Size Evaluation"
Two parameters are needed to set.
1) "Sample Size": the program sets the default value 10,000. The user may change it from the drop-down menu or type a specific number there.
2) "Sites": the program sets the default values as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100. The user may change these values, but it requires the numbers in an increasing order.
- "Data Files"
Click "Browse Files" button to select the source files in the dbSNP or Celera fasta format. Hold ''ctl" or "shift" key to choose multiple files at one time.
- "Reference Values (%)"
User may manually input the reference values, select from drop-down menu, or choose reference sequence files to calculate. Currently, SNPNB provides the reference values for the mouse genome (NCBI builder 32) and the human genome (NCBI builder 34).
- Run and log
Click "Run" button at main menu window after the parameters are set. The "Running logs" field displays the status because it usually takes hours to analyze the genome-wide data. After it finishes, the program presents the result window automatically. Both main menu window and result window contain "Running logs" field.
- Display the results
The results can be displayed by clicking on the buttons. There are two parts "SNP" and "Region" in the left panel of the result window. If "Known Mutation Direction" is selected in the main menu, the user may choose to see the results of "directed" or "undirected" type of SNPs. In the main panel, the user may choose the button "SNP Neighboring Analysis" or "Sample Size Evaluation".
- "SNP Neighboring Analysis " :
There are three options to display results.
(1) Resutls of "Count" in table.
(2) Results of "Frequency" in table.
(3) Results of "Bias" in table or graphics.
- "Sample Size Evaluation":
The results are shown in table or graphics.
- Close the program
We tested the performance of SNPNB using the mouse and human dbSNP data (build 121) on a Dell PC (CPU 1.3GHz, Memory 2GB, Windows XP, JDK1.4 and Perl5.83) and a Dell Workstation (CPU 2×3.0GHz, Memory 4GB, Redhad Linux WS, JDK1.4 and Perl5.80).
System |
Process |
Test data |
Mouse
(469 445 SNPs) |
Human
(8 043 656 SNPs) |
|
PC
(Windows) | Data preprocess |
| 0 h 2 m 3 s |
1 h 35 m 4 s |
Neighboring effect analysis |
| 0 h 24 m 32 s |
10 h 36 m 11 s |
|
Sample size evaluation
(Sample size: 50 000)
(Repeat: 1000) | Random number generation |
4 h 17 m 32 s |
4 h 38 m 17 s |
Random number sort |
0 h 19 m 15 s | 4 h 48 m 45 s |
Sample size evaluation |
0 h 59 m 39 s |
3 h 43 m 18 s |
Total elapsed time |
|
6 h 3 m 1 s |
25 h 21 m 35 s |
Workstation
(Linux) |
Data preprocess |
| 0 h 2 m 51 s |
2 h 50 m 25 s |
Neighboring effect analysis |
| 0 h 29 m 18 s |
11 h 17 m 50 s |
Sample size evaluation
(Sample size: 50 000)
(Repeat: 1000) |
Random number generation |
6 h 9 m 37 s |
7 h 14 m 15 s |
Random number sort |
0 h 13 m 9 s |
3 h 4 m 40 s |
Sample size evaluation |
1 h 6 m 9 s |
3 h 19 m 16 s |
Total elapsed time |
|
8 h 1 m 4 s |
27 h 46 m 26 s |
We add some screenshots here to show the major functions of SNPNB. The screenshots were obtained using the human SNP data (build121) downloaded from dbSNP database:
Discussion
- Intermediate file
SNPNB preprocesses source data file(s) into a single intermediate file. This makes it simple for later computing.
- Interface in Java
The Java.Swing was used to implement the user interface and graphics.
- Random number generation
To evaluate how many SNPs are sufficient to represent the bias patterns from the whole data (e.g. genome data), the random numbers will be generated. For example, the program may generate 1,000 samples of size 50,000 of no-duplicated SNPs from the total of 8,043,656 human SNPs. We implemented the re-sampling algorithm in Perl and found it was extremely slow when the sample size is greater than 20,000 for each subset. The Java program with the function of hash set was then used. In addition, we applied a stratified sampling strategy. This improved the performance significantly, for example, it reduced the running time from 30 hours to 4 hours for generating 1,000 samples of size 50,000.
- Object-oriented design
It is not efficient to compute 1,000 samples by scanning the whole dataset 1,000 times. We designed and implemented an object-oriented program in Perl. Each sample was implemented as one object. The program only traverses the whole data once, therefore, it improved the performance approximately 1,000 times.
- Random number sort
We sorted the random numbers indexed by SNP to take the advantage of the object-oriented implementation above. The memory usage is very high when the data size is large (e.g. human dbSNPs). We modified the sorting algorithm by scanning random numbers multiple times in sub-ranges. This significantly reduced memory usage while the computational time was acceptable. For example, it needed 200 MB and run within 5 hours for 1000 samples of size 50,000 from human dbSNPs data. By this modification, it is able to perform statistical evalutation for human dataset in a standard PC.
- Operating systems
SNPNB was tested in WindowXP and Redhat Enterprise Linux. It should be able to run any Windows version and Unix based operation systems, given JRE and Perl are installed.
- Exceptions thrown by Java
Some exceptions may be thrown by Java and displayed in the console window. Those exceptions have no impact on the performance and accuracy of SNPNB1.01.
Citation
Zhang F, Zhao Z. (2005) SNPNB: analyzing neighboring-nucleotide biases on single nucleotide polymorphisms (SNPs). Bioinformatics 21:2517-2519 .
Contact
Zhongming Zhao <zzhao@vcu.edu>
|