FATHMM-XF: Enhanced Accuracy in Predicting the Functional Consequences of Non-Coding and Coding Single Nucleotide Variants (SNVs)

Enter your mutations:

Mutations entered by hand must use a comma-separated (chromosome,position,reference,mutant) format, with all positions relative to the GRCh37/hg19 (ENSEMBL release 87) version of the human genome. Mutations uploaded from a file should use the VCF format with a minimum of five columns (chromosome, position, id, reference, mutant). Note: if a VCF file is uploaded, any entries in the User Input field will be ignored.

Publication:

If you use the data on this website, please cite the following publication:

Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C. FATHMM-XF: enhanced accuracy in the prediction of pathogenic sequence variants via an extended feature set, Bioinformatics, September 2017.

Input Format:


Our web form accepts comma-separated mutation data in the following format:

  1. Chromosome
  2. Position
  3. Reference Base
  4. Mutant Base

Note that FATHMM-XF predictions are based on the GRCh37/hg19 genome build.


For example:

1,20915172,C,T
2,48025976,G,T
4,80977297,T,A
5,1293898,G,A
6,51713769,C,T
9,79852917,G,C
11,1094690,C,T
11,14992735,C,G

Note: 'Chr' should be omitted when specifying the chromosome above (e.g. '1', not 'Chr1'). All predictions are derived using the forward strand.

VCF files

The software also accepts Variant Call Format (VCF) files with up to 100,000 queries. This is a tab-delimited format that must have, at a minimum, these first five columns:

  1. Chromosome
  2. Position
  3. Identifier
  4. Reference Base
  5. Mutant Base

For example:

1	20915172	.	C	T
2	48025976	.	G	T
4	80977297	.	T	A
5	1293898	.	G	A
6	51713769	.	C	T
9	79852917	.	G	C
11	1094690	.	C	T
11	14992735	.	C	G

The VCF format specification requires eight columns, but here only the chromosome, position, reference and mutant bases are used and reported.


Back to Top ...


Prediction Interpretation:


Predictions are given as p-values in the range [0, 1]: values above 0.5 are predicted to be deleterious, while those below 0.5 are predicted to be neutral or benign. P-values close to the extremes (0 or 1) are the highest-confidence predictions that yield the highest accuracy.

We use distinct predictors for positions either in coding regions (positions within coding-sequence exons) or non-coding regions (positions in intergenic regions, introns or non-coding genes). The coding predictor is based on six groups of features representing sequence conservation, nucleotide sequence characteristics, genomic features (codons, splice sites, etc.), amino acid features and expression levels in different tissues. The non-coding predictor uses five feature groups that encompass nearly the same kinds of data, the primary exception being evidence for open chromatin.

Publications that use these data should cite the following publication:

Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C. FATHMM-XF: enhanced accuracy in the prediction of pathogenic sequence variants via an extended feature set. (journal submission)

Because this work is related to FATHMM and FATHMM-MKL, publications that use these data may also wish to cite the following:

Shihab HA, Rogers MF, Gough J, Mort M, Cooper DN, Day INM, Gaunt TR, Campbell C (2015). An Integrative Approach to Predicting the Functional Consequences of Non-coding and Coding Sequence Variation. Bioinformatics 2015 May 15;31(10):1536-43. FATHMM-XF


Back to Top ...

Download:

If you wish to run FATHMM-XF locally, you will need to download the coding and non-coding databases, along with a Python script that accepts queries in the same VCF format as the website. The only system requirements are Python (v2.7.6 or later) and tabix.

Data

Coding region database (630MB): fathmm_xf_coding.vcf.gz
Coding region tabix index (630KB): fathmm_xf_coding.vcf.gz.tbi
Noncoding region database (36GB): fathmm_xf_noncoding.vcf.gz
Noncoding region tabix index (3MB): fathmm_xf_noncoding.vcf.gz.tbi

Script

The query script is fathmm_xf_query.py. It is a rudimentary tool that looks for the database files in the local directory and returns tabular output similar to results presented on the website.


Usage: fathmm_xf_query.py query-file [options]

Predict the pathogenic potential of single nucleotide variants (SNVs).  The query
file must be a list of queries in VCF format.  Note: the id column and columns
beyond the first five are ignored.

chromosome <tab> position <tab> id <tab> reference <tab> mutant <tab> ...

Example:

1   69094   .   G   A
11  168961  .   T   A
18  119888  .   G   A

Options:
  -h, --help  show this help message and exit
  -c CDB      CScape coding database [default: fathmm_xf_coding.vcf.gz]
  -n NDB      CScape noncoding database [default: fathmm_xf_noncoding.vcf.gz]
  -o OUTPUT   Output file [default: stdout]
  -v          Verbose mode [default: False]

Back to Top ...