mutation fields

The mutation endpoint contains data from TCGA about genes and mutations

Column names that have a . between words denote that the term after the . is a nested field. Nesting structure can be more easily browsed in the mutation JSON schema

column_name	description	data_type
AA_MAF	Non-reference allele and frequency of existing variant in NHLBI-ESP African American population	STRING
AFR_MAF	Non-reference allele and frequency of existing variant in 1000 Genomes combined African population	FLOAT
ALLELE_NUM	Allele number from input; 0 is reference, 1 is first alternate etc.	STRING
AMR_MAF	Non-reference allele and frequency of existing variant in 1000 Genomes combined American population	FLOAT
Allele	The variant allele used to calculate the consequence	STRING
Amino_acids	Amino acid substitution caused by the mutation. Only given if the variation affects the protein-coding sequence	STRING
BIOTYPE	Biotype of transcript	STRING
CANONICAL	A flag (YES) indicating that the VEP-based canonical transcript, the longest translation, was used for this gene. If not, the value is null	STRING
CCDS	The CCDS identifier for this transcript, where applicable	STRING
CDS_position	Relative position of base pair in coding sequence. A - symbol is displayed as the numerator if the variant does not appear in coding sequence	STRING
CLIN_SIG	Clinical significance of variant from dbSNP	STRING
CONTEXT	The reference allele per VCF specs, and its five flanking base pairs	STRING
COSMIC	Overlapping COSMIC variants	STRING
Center	One or more genome sequencing center reporting the variant	STRING
Chromosome	Chromosome, possible values: chr1-22, and chrX	STRING
Codons	The alternative codons with the variant base in upper case	STRING
Consequence	Consequence type of this variant; sequence ontology terms	STRING
DISTANCE	Shortest distance from the variant to transcript	INTEGER
DOMAINS	The source and identifier of any overlapping protein domains	STRING
EAS_MAF	Non-reference allele and frequency of existing variant in 1000 Genomes combined East Asian population	FLOAT
EA_MAF	Non-reference allele and frequency of existing variant in NHLBI-ESP European American population	STRING
ENSP	The Ensembl protein identifier of the affected transcript	STRING
EUR_MAF	Non-reference allele and frequency of existing variant in 1000 Genomes combined European population	FLOAT
EXON	The exon number (out of total number)	STRING
End_Position	Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate	INTEGER
Entrez_Gene_Id	Entrez gene ID (an integer). 0 is used for regions that do not correspond to a gene region or Ensembl ID	INTEGER
ExAC_AF	Global Allele Frequency from ExAC	STRING
ExAC_AF_AFR	African/African American Allele Frequency from ExAC	STRING
ExAC_AF_AMR	American Allele Frequency from ExAC	STRING
ExAC_AF_Adj	Adjusted Global Allele Frequency from ExAC	STRING
ExAC_AF_EAS	East Asian Allele Frequency from ExAC	STRING
ExAC_AF_FIN	Finnish Allele Frequency from ExAC	STRING
ExAC_AF_NFE	Non-Finnish European Allele Frequency from ExAC	STRING
ExAC_AF_OTH	Other Allele Frequency from ExAC	STRING
ExAC_AF_SAS	South Asian Allele Frequency from ExAC	STRING
Existing_variation	Known identifier of existing variation	STRING
Exon_Number	The exon number (out of total number)	STRING
FILTER	Copied from input VCF. This includes filters implemented directly by the variant caller and other external software used in the DNA-Seq pipeline. See below for additional details.	STRING
Feature	Stable Ensembl ID of feature (transcript, regulatory, motif)	STRING
Feature_type	Type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature (or blank)	STRING
GDC_FILTER	GDC filters applied universally across all MAFs	STRING
GDC_Validation_Status	GDC implementation of validation checks. See notes section (#5) below for details	STRING
GMAF	Non-reference allele and frequency of existing variant in 1000 Genomes	FLOAT
Gene	The gene symbol. In this table, gene symbol is gene name e.g. ACADVL	STRING
HGNC_ID	Gene identifier from the HUGO Gene Nomenclature Committee if applicable	STRING
HGVS_OFFSET	Indicates by how many bases the HGVS notations for this variant have been shifted	INTEGER
HGVSc	The coding sequence of the variant in HGVS recommended format	STRING
HGVSp	The protein sequence of the variant in HGVS recommended format. p.= signifies no change in the protein	STRING
HGVSp_Short	Same as the HGVSp column, but using 1-letter amino-acid codes	STRING
Hugo_Symbol	HUGO symbol for the gene (HUGO symbols are always in all caps). Unknown is used for regions that do not correspond to a gene	STRING
IMPACT	The impact modifier for the consequence type	STRING
INTRON	The intron number (out of total number)	STRING
MC3_Overlap	Indicates whether this region overlaps with an MC3 variant for the same sample pair	STRING
MINIMISED	Alleles in this variant have been converted to minimal representation before consequence calculation (1 or null)	STRING
Matched_Norm_Sample_UUID	Unique GDC identifier for normal aliquot (10189 unique)	STRING
Mutation_Status	An assessment of the mutation as somatic, germline, LOH, post transcriptional modification, unknown, or none. The values allowed in this field are constrained by the value in the Validation_Status field	STRING
NCBI_Build	The reference genome used for the alignment (GRCh38)	STRING
One_Consequence	The single consequence of the canonical transcript in sequence ontology terms, eg missense_variant	STRING
PHENO	Indicates if existing variant is associated with a phenotype, disease or trait (0, 1, or null)	STRING
PICK	Indicates if this block of consequence data was picked by VEP's pick feature (1 or null)	STRING
PUBMED	Pubmed ID(s) of publications that cite existing variant	STRING
PolyPhen	The PolyPhen prediction and/or score	STRING
Protein_position	Relative position of affected amino acid in protein. A - symbol is displayed as the numerator if the variant does not appear in coding sequence	STRING
RefSeq	RefSeq identifier for this transcript	STRING
Reference_Allele	The plus strand reference allele at this position. Includes the deleted sequence for a deletion or - for an insertion	STRING
SAS_MAF	Non-reference allele and frequency of existing variant in 1000 Genomes combined South Asian population	FLOAT
SIFT	The SIFT prediction and/or score, with both given as prediction (score)	STRING
SOMATIC	Somatic status of each ID reported under Existing_variation (0, 1, or null)	STRING
SWISSPROT	UniProtKB/Swiss-Prot accession	STRING
SYMBOL	Eg TP53, LRP1B, etc (same as Hugo_Symbol field except blank instead of Unknown	STRING
SYMBOL_SOURCE	The source of the gene symbol, usually HGNC, rarely blank, other sources include Uniprot_gn, EntrezGene, etc	STRING
Sequencer	Instrument used to produce primary sequence data	STRING
Start_Position	Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate	INTEGER
Strand	Either + or - to denote whether read mapped to the sense (+) or anti-sense (-) strand	STRING
TRANSCRIPT_STRAND	The DNA strand (1 or -1) on which the transcript/feature lies	INTEGER
TREMBL	UniProtKB/TrEMBL identifier of protein product	STRING
TSL	Transcript support level, which is based on independent RNA analyses	INTEGER
Transcript_ID	Ensembl ID of the transcript affected by the variant	STRING
Tumor_Sample_UUID	Unique GDC identifier for tumor aliquot (10189 unique)	STRING
Tumor_Seq_Allele1	Primary data genotype for tumor sequencing (discovery) allele 1. A - symbol for a deletion represents a variant. A - symbol for an insertion represents wild-type allele. Novel inserted sequence for insertion does not include flanking reference bases	STRING
Tumor_Seq_Allele2	Primary data genotype for tumor sequencing (discovery) allele 2. A - symbol for a deletion represents a variant. A - symbol for an insertion represents wild-type allele. Novel inserted sequence for insertion does not include flanking reference bases	STRING
Tumor_Validation_Allele1	Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. A - symbol for a deletion represents a variant. A - symbol for an insertion represents wild-type allele. Novel inserted sequence for insertion does not include flanking reference bases	STRING
Tumor_Validation_Allele2	Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2	STRING
UNIPARC	UniParc identifier of protein product	STRING
VARIANT_CLASS	Sequence Ontology variant class	STRING
Validation_Method	The assay platforms used for the validation call	STRING
Variant_Classification	Translational effect of variant allele	STRING
Variant_Type	Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP (di-nucleotide polymorphism) but for three consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of four or more (SNP, DNP, TNP, ONP, INS, DEL, or Consolidated)	STRING
aliquot_barcode_normal	TCGA aliquot barcode for the normal control, eg TCGA-12-1089-01A-01D-0517-01	STRING
aliquot_barcode_tumor	TCGA aliquot barcode for the tumor, eg TCGA-12-1089-01A-01D-0517-01	STRING
all_effects	A semicolon delimited list of all possible variant effects, sorted by priority ([Symbol,Consequence,HGVSp_Short,Transcript_ID,RefSeq,HGVSc,Impact,Canonical,Sift,PolyPhen,Strand])	STRING
cDNA_position	Relative position of base pair in the cDNA sequence as a fraction. A - symbol is displayed as the numerator if the variant does not appear in cDNA	STRING
callerName		-delimited list of mutation caller(s) that agreed on this particular call, always in alphabetical order: muse, mutect, somaticsniper, varscan
case_barcode	Original TCGA case barcode, eg TCGA-DX-A8BN	STRING
case_id	Unique GDC identifier for the underlying case	STRING
dbSNP_RS	The rs-IDs from the dbSNP database, novel if not found in any database used, or null if there is no dbSNP record, but it is found in other databases	STRING
dbSNP_Val_Status	The dbSNP validation status is reported as a semicolon-separated list of statuses. The union of all rs-IDs is taken when there are multiple	STRING
fileName		-delimited list of name of underlying MAF file
fileUUID		-delimited list of unique GDC identifiers for underlying MAF file
n_depth	Read depth across this locus in normal BAM	STRING
normal_bam_uuid	Unique GDC identifier for the underlying normal bam file	STRING
project_short_name	Project name abbreviation; the program name appended with a project name abbreviation; eg. TCGA-OV, etc.	STRING
sample_barcode_normal	TCGA sample barcode for the normal control, eg TCGA-12-1089-01A. One sample may have multiple sets of CN segmentations corresponding to multiple aliquots; use GROUP BY appropriately in queries	STRING
sample_barcode_tumor	TCGA sample barcode for the tumor, eg TCGA-12-1089-01A. One sample may have multiple sets of CN segmentations corresponding to multiple aliquots; use GROUP BY appropriately in queries	STRING
src_vcf_id		-delimited list of GDC VCF file identifiers
t_alt_count	Read depth supporting the variant allele in tumor BAM	STRING
t_depth	Read depth across this locus in tumor BAM	STRING
t_ref_count	Read depth supporting the reference allele in tumor BAM	STRING
tumor_bam_uuid	Unique GDC identifier for the underlying bam file	STRING

Last update: 2022-09-28