PCAtag Software for Selecting Tagging-SNPs using Principal Component Analysis

Download

Linux  - PCAtag package, fastPHASE and example files.  Linux users require to download and install R prior using the software.

PC  - PCAtag package, fastPHASE, R and example files. 

Decompress / unzip downloaded file prior any execution.

 

Getting Started

Requirements

Execute PCAtag

Examples

 

Welcome to PCAtag Home Page!

To be able to comprehensively test the role of candidate genes in association studies the selection of informative SNPs is paramount.
Specifically, it is important to select tagging-SNPs (tSNPs) that represent a large portion (>90%) of the genetic variation of a gene.
Here we describe a new software tool, PCAtag , that performs tSNP selection using principal component analysis (PCA)
as described in Horne and Camp (2004). The advantage of PCA analysis for tSNP selection is that LD groups do not need to be contiguous and can be overlapping. This flexible framework does not impose over-simplified assumptions on the genetic architecture structure, and likely fits reality much better.

Algorithms

  • Bayesian method for reconstructing haplotypes is used by interfacing with the software fastPHASE (Stephens et al 2006).
  • Principal Component Analysis (PCA) using a varimax rotation is performed by interfacing with the FactoMineR add-on package available in 'R'.
  • Procedure for determining LD groups and tSNP selection follow from the two step PCA method outlined in Horne and Camp (2004) into multi-step PCA.

Novel Features

   Genotype Data:

  • One issue in performing the based on haplotype data is that haplotypes are not directly observed and must be estimated.
  • PCAtag has an option to perform the PCA based on genotype data directly.

   Phenotype Data:

  • Allele frequencies, haplotype frequencies and LD structure may differ between cases and controls.
  • If phenotype data (or any dichotomous subset criteria) is entered, tagging will be performed in the cases and controls separately, as well as together.
  • Knowledge of such difference at tSNP stage will allow for more powerful subsequent association analyses.

 

fastPHASE   implements a Bayesian statistical method for reconstructing haplotypes from population genotype data.

R   is a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 30900051-07-0

 

Last update June 1, 2009