Bioinformatics Lab @ VIPBG

SNPNB 1.01

(Previous version: SNPNB 1.0)

Introduction

SNPNB is a user-friendly application for analyzing Single Nucleotides Polymorphism NeighBoring compositions. The current version is SNPNB 1.01. It has two main functions: analysis of SNP neighboring-nucleotide patterns and evaluation of the effective SNP size for the observed patterns. The software was implemented byJava and Perl.

Design

    1. Data preprocess
      All source data files are preprocessed and saved in a single intermediate file. Currently, SNPNB can automatically process the data in dbSNP fasta format and the Celera fasta format.
    2. Set reference nucleotide frequency values (%)
      Reference nucleotide frequency values (e.g. genome average) can be manually entered or computed from sequence data (fasta format).
    3. Analyze SNP neighboring effect
      • Choose to analyze SNPs with unknown or known SNP mutation direction.
      • Set SNP neighboring nucleotide positions or ranges.
      • Count the number for each nucleotide at each position for each type of SNPs.
      • Calculate the frequencies correspondingly.
      • Obtain the proportion biases correspondingly relative to the refercence values.
    4. Evaluate the effective SNP size
      • Generate random numbers by a re-sampling strategy based on the size of the whole data (N).
      • Reorganize and sort the random numbers using each SNP as an object index.
      • Process each randomized SNP subset and evaluate the likelihood of bias patterns.
    The software design is illustrated in the flowchart. The annotation of the flanking sites of a SNP is shown here.

System Requirements

Operating system: Windows or Linux. It is possible to work in Unix too.
Memory: 512 MB minimum (1 GB is recommended for large size data).
Hard disk: 2 times the size of source data.
Running environment: JRE 1.4 or higher (JRE 1.4 is recommended), and Perl 5.6 or higher ( Perl 5.8 is recommended).
(Installation instructions for JRE 1.4.2 and Perl 5.6/5.8 are available here)

Downolad

Program: Windows: snpnb.zip; Linux or Unix: snpnb.tgz (updated 2005-02-02)

Test data: Human Chr22 dbSNP build 121 (58MB) ;    human dbSNP build 121 (3.8GB) ;    mouse dbSNP build 121 (69MB) ; sample fasta format data with SNP ancestor information (168k)

Fasta format used in dbSNP (6KB); Celera (4KB)

Instruction

  1. Download snpnb.zip (for windows) or snpnb.tgz (for Linux or Unix)
  2. Uncompress the downloaded file
    Windows: use winzip to uncompress.
    Linux or Unix: type "tar -xzvf snpnb.tgz".
  3. Start SNPNB program
    Open a console window. Go to the directory "snpnb" and find the file "snpnb.pl". Type "perl snpnb.pl" to start SNPNB. A program main menu window will popup.
  4. Set parameters in the main menu
    1. "Analysis menu"
      There are two options. The user may select one by clicking the checkbox.
      1. Option 1: "Neighboring Effect Analysis "
        Choose to analyze "Unknown Mutation Direction" or "Known Mutation Direction" SNP data. The program sets the default values of "Sites or Ranges" as 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 250. The user may change these values, but it requires the numbers in an increasing order.
      2. Option 2: "Sample Size Evaluation"
        Two parameters are needed to set.
        1) "Sample Size": the program sets the default value 10,000. The user may change it from the drop-down menu or type a specific number there.
        2) "Sites": the program sets the default values as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100. The user may change these values, but it requires the numbers in an increasing order.
    2. "Data Files"
      Click "Browse Files" button to select the source files in the dbSNP or Celera fasta format. Hold ''ctl" or "shift" key to choose multiple files at one time.
    3. "Reference Values (%)"
      User may manually input the reference values, select from drop-down menu, or choose reference sequence files to calculate. Currently, SNPNB provides the reference values for the mouse genome (NCBI builder 32) and the human genome (NCBI builder 34).
  5. Run and log
    Click "Run" button at main menu window after the parameters are set. The "Running logs" field displays the status because it usually takes hours to analyze the genome-wide data. After it finishes, the program presents the result window automatically. Both main menu window and result window contain "Running logs" field.
  6. Display the results
    The results can be displayed by clicking on the buttons. There are two parts "SNP" and "Region" in the left panel of the result window. If "Known Mutation Direction" is selected in the main menu, the user may choose to see the results of "directed" or "undirected" type of SNPs. In the main panel, the user may choose the button "SNP Neighboring Analysis" or "Sample Size Evaluation".
    1. "SNP Neighboring Analysis " :
      There are three options to display results.
      (1) Resutls of "Count" in table.
      (2) Results of "Frequency" in table.
      (3) Results of "Bias" in table or graphics.
    2. "Sample Size Evaluation":
      The results are shown in table or graphics.
  7. Close the program

Performance

We tested the performance of SNPNB using the mouse and human dbSNP data (build 121) on a Dell PC (CPU 1.3GHz, Memory 2GB, Windows XP, JDK1.4 and Perl5.83) and a Dell Workstation (CPU 2×3.0GHz, Memory 4GB, Redhad Linux WS, JDK1.4 and Perl5.80).

System
Process
Test data
Mouse
(469 445 SNPs)
Human
(8 043 656 SNPs)
 
PC
(Windows)

Data preprocess

 
0 h 2 m 3 s
1 h 35 m 4 s
Neighboring effect analysis
 
0 h 24 m 32 s
10 h 36 m 11 s
Sample size evaluation
(Sample size: 50 000)
(Repeat: 1000)   
Random number generation
4 h 17 m 32 s
4 h 38 m 17 s
Random number sort
0 h 19 m 15 s
4 h 48 m 45 s
Sample size evaluation
0 h 59 m 39 s
3 h 43 m 18 s
Total elapsed time
 
6 h 3 m 1 s
25 h 21 m 35 s
Workstation
(Linux)

Data preprocess

 
0 h 2 m 51 s
2 h 50 m 25 s
Neighboring effect analysis
 
0 h 29 m 18 s
11 h 17 m 50 s
Sample size evaluation
(Sample size: 50 000)
(Repeat: 1000)
Random number generation
6 h 9 m 37 s
7 h 14 m 15 s
Random number sort
0 h 13 m 9 s
3 h 4 m 40 s
Sample size evaluation
1 h 6 m 9 s
3 h 19 m 16 s
Total elapsed time
 
8 h 1 m 4 s
27 h 46 m 26 s

Screenshots

We add some screenshots here to show the major functions of SNPNB. The screenshots were obtained using the human SNP data (build121) downloaded from dbSNP database:

Discussion

  1. Intermediate file
    SNPNB preprocesses source data file(s) into a single intermediate file. This makes it simple for later computing.
  2. Interface in Java
    The Java.Swing was used to implement the user interface and graphics.
  3. Random number generation
    To evaluate how many SNPs are sufficient to represent the bias patterns from the whole data (e.g. genome data), the random numbers will be generated. For example, the program may generate 1,000 samples of size 50,000 of no-duplicated SNPs from the total of 8,043,656 human SNPs. We implemented the re-sampling algorithm in Perl and found it was extremely slow when the sample size is greater than 20,000 for each subset. The Java program with the function of hash set was then used. In addition, we applied a stratified sampling strategy. This improved the performance significantly, for example, it reduced the running time from 30 hours to 4 hours for generating 1,000 samples of size 50,000.
  4. Object-oriented design
    It is not efficient to compute 1,000 samples by scanning the whole dataset 1,000 times. We designed and implemented an object-oriented program in Perl. Each sample was implemented as one object. The program only traverses the whole data once, therefore, it improved the performance approximately 1,000 times.
  5. Random number sort
    We sorted the random numbers indexed by SNP to take the advantage of the object-oriented implementation above. The memory usage is very high when the data size is large (e.g. human dbSNPs). We modified the sorting algorithm by scanning random numbers multiple times in sub-ranges. This significantly reduced memory usage while the computational time was acceptable. For example, it needed 200 MB and run within 5 hours for 1000 samples of size 50,000 from human dbSNPs data. By this modification, it is able to perform statistical evalutation for human dataset in a standard PC.
  6. Operating systems
    SNPNB was tested in WindowXP and Redhat Enterprise Linux. It should be able to run any Windows version and Unix based operation systems, given JRE and Perl are installed.
  7. Exceptions thrown by Java
    Some exceptions may be thrown by Java and displayed in the console window. Those exceptions have no impact on the performance and accuracy of SNPNB1.01.

Citation

Zhang F, Zhao Z. (2005) SNPNB: analyzing neighboring-nucleotide biases on single nucleotide polymorphisms (SNPs). Bioinformatics 21:2517-2519 .

Contact

Zhongming Zhao <zzhao@vcu.edu>


 
Copyright © 2004, Bioinformatics Lab @ VIPBG, VCU. All Rights Reserved.
VCUVIPBGCSBCBBSISOM