Available Search Terms¶

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt. settings are pre-configuring how itables should display our tables, with scrolling and paging enabled. Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:

            
                Copied!
                
                    
                    
                
                

        
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())

2022.12.21

You can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.

columns¶

Accordingly, to see what search fields are available, we use the command columns:

In [2]:

            
                Copied!
                
columns()
columns()

Out[2]:

Number of Fields 174

This output tells us that there are 177 searchable fields, but it doesn't output them directly. Running CDA commands like this first gives you an overall summary of the data you're going to get, and so is nice for doing a gut check. However, if we want to see the data on our screen we can have columns() print out it's contents to a list instead:

In [3]:

            
                Copied!
                
columns().to_list()
columns().to_list()

Out[3]:

[{'fieldName': 'ALLELE_NUM',
  'endpoint': 'mutation',
  'description': 'Allele number from input; 0 is reference, 1 is first alternate etc.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'fileName',
  'endpoint': 'mutation',
  'description': '|-delimited list of name of underlying MAF file',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'PICK',
  'endpoint': 'mutation',
  'description': "Indicates if this block of consequence data was picked by VEP's   pick feature (1 or null)",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'subject_associated_project',
  'endpoint': 'subject',
  'description': 'The list of Projects associated with the Subject.',
  'type': 'STRING',
  'mode': 'REPEATED'},
 {'fieldName': 'days_to_treatment_start',
  'endpoint': 'treatment',
  'description': 'The timepoint at which the treatment started.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'UNIPARC',
  'endpoint': 'mutation',
  'description': 'UniParc identifier of protein product',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'n_depth',
  'endpoint': 'mutation',
  'description': 'Read depth across this locus in normal BAM',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 't_alt_count',
  'endpoint': 'mutation',
  'description': 'Read depth supporting the variant allele in tumor BAM',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Feature',
  'endpoint': 'mutation',
  'description': 'Stable Ensembl ID of feature (transcript, regulatory, motif)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'CONTEXT',
  'endpoint': 'mutation',
  'description': 'The reference allele per VCF specs, and its five flanking base pairs',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'CLIN_SIG',
  'endpoint': 'mutation',
  'description': 'Clinical significance of variant from dbSNP',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Transcript_ID',
  'endpoint': 'mutation',
  'description': 'Ensembl ID of the transcript affected by the variant',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'age_at_diagnosis',
  'endpoint': 'diagnosis',
  'description': 'The age in days of the individual at the time of diagnosis.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'Gene',
  'endpoint': 'mutation',
  'description': 'The gene symbol. In this table, gene symbol is gene name e.g. ACADVL',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'HGNC_ID',
  'endpoint': 'mutation',
  'description': 'Gene identifier from the HUGO Gene Nomenclature Committee if applicable',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_AMR',
  'endpoint': 'mutation',
  'description': 'American Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 't_depth',
  'endpoint': 'mutation',
  'description': 'Read depth across this locus in tumor BAM',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'HGVSp_Short',
  'endpoint': 'mutation',
  'description': 'Same as the HGVSp column, but using 1-letter amino-acid codes',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Reference_Allele',
  'endpoint': 'mutation',
  'description': 'The plus strand reference allele at this position. Includes the deleted sequence for a deletion or - for an insertion',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Strand',
  'endpoint': 'mutation',
  'description': 'Either + or - to denote whether read mapped to the sense (+) or anti-sense (-) strand',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'imaging_modality',
  'endpoint': 'file',
  'description': 'An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'fileUUID',
  'endpoint': 'mutation',
  'description': '|-delimited list of unique GDC identifiers for underlying MAF file',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'DISTANCE',
  'endpoint': 'mutation',
  'description': 'Shortest distance from the variant to transcript',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'SYMBOL_SOURCE',
  'endpoint': 'mutation',
  'description': 'The source of the gene symbol, usually HGNC, rarely blank, other sources include Uniprot_gn, EntrezGene, etc',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'morphology',
  'endpoint': 'diagnosis',
  'description': 'Code that represents the histology of the disease using the third edition of the International Classification of Diseases for Oncology, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'MC3_Overlap',
  'endpoint': 'mutation',
  'description': 'Indicates whether this region overlaps with an MC3 variant for the same sample pair',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'drs_uri',
  'endpoint': 'file',
  'description': 'A string of characters used to identify a resource on the Data Repo Service(DRS).',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'diagnosis_identifier_system',
  'endpoint': 'diagnosis',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_identifier_value',
  'endpoint': 'treatment',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Start_Position',
  'endpoint': 'mutation',
  'description': 'Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'GDC_Validation_Status',
  'endpoint': 'mutation',
  'description': 'GDC implementation of validation checks. See notes section (#5) below for details',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'member_of_research_project',
  'endpoint': 'researchsubject',
  'description': 'A reference to the Study(s) of which this ResearchSubject is a member.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'specimen_type',
  'endpoint': 'specimen',
  'description': 'The high-level type of the specimen, based on its how it has been derived from the original extracted sample. \n',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Existing_variation',
  'endpoint': 'mutation',
  'description': 'Known identifier of existing variation',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'SYMBOL',
  'endpoint': 'mutation',
  'description': 'Eg TP53, LRP1B, etc (same as Hugo_Symbol field except blank instead of Unknown',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_SAS',
  'endpoint': 'mutation',
  'description': 'South Asian Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'researchsubject_id',
  'endpoint': 'researchsubject',
  'description': "The 'logical' identifier of the entity in the system of record, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'VARIANT_CLASS',
  'endpoint': 'mutation',
  'description': 'Sequence Ontology variant class',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'grade',
  'endpoint': 'diagnosis',
  'description': 'The degree of abnormality of cancer cells, a measure of differentiation, the extent to which cancer cells are similar in appearance and function to healthy cells of the same tissue type. The degree of differentiation often relates to the clinical behavior of the particular tumor. Based on the microscopic findings, tumor grade is commonly described by one of four degrees of severity. Histopathologic grade of a tumor may be used to plan treatment and estimate the future course, outcome, and overall prognosis of disease. Certain types of cancers, such as soft tissue sarcoma, primary brain tumors, lymphomas, and breast have special grading systems.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'AA_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in NHLBI-ESP African American population',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'project_short_name',
  'endpoint': 'mutation',
  'description': 'Project name abbreviation; the program name appended with a project name abbreviation; eg. TCGA-OV, etc.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'days_to_death',
  'endpoint': 'subject',
  'description': "Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.",
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_AFR',
  'endpoint': 'mutation',
  'description': 'African/African American Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'file_associated_project',
  'endpoint': 'file',
  'description': 'A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the $ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.',
  'type': 'STRING',
  'mode': 'REPEATED'},
 {'fieldName': 'primary_diagnosis_condition',
  'endpoint': 'researchsubject',
  'description': "The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O).   This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Entrez_Gene_Id',
  'endpoint': 'mutation',
  'description': 'Entrez gene ID (an integer). 0 is used for regions that do not correspond to a gene region or Ensembl ID',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'RefSeq',
  'endpoint': 'mutation',
  'description': 'RefSeq identifier for this transcript',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'dbSNP_RS',
  'endpoint': 'mutation',
  'description': 'The rs-IDs from the   dbSNP database, novel if not found in any database used, or null if there is no dbSNP record, but it is found in other databases',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Validation_Method',
  'endpoint': 'mutation',
  'description': 'The assay platforms used for the validation call',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'dbgap_accession_number',
  'endpoint': 'file',
  'description': 'The dbgap accession number for the project.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ethnicity',
  'endpoint': 'subject',
  'description': "An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'PHENO',
  'endpoint': 'mutation',
  'description': 'Indicates if existing variant is associated with a phenotype, disease or trait (0, 1, or null)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'BIOTYPE',
  'endpoint': 'mutation',
  'description': 'Biotype of transcript',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'researchsubject_identifier_system',
  'endpoint': 'researchsubject',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'dbSNP_Val_Status',
  'endpoint': 'mutation',
  'description': 'The dbSNP validation status is reported as a semicolon-separated list of statuses. The union of all rs-IDs is taken when there are multiple',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'GDC_FILTER',
  'endpoint': 'mutation',
  'description': 'GDC filters applied universally across all MAFs',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'file_identifier_value',
  'endpoint': 'file',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'AFR_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in 1000 Genomes combined African population',
  'type': 'FLOAT',
  'mode': 'NULLABLE'},
 {'fieldName': 'days_to_collection',
  'endpoint': 'specimen',
  'description': 'The number of days from the index date to either the date a sample was collected for a specific study or project, or the date a subject underwent a procedure (e.g. surgical resection) yielding a sample that was eventually used for research.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'Variant_Type',
  'endpoint': 'mutation',
  'description': 'Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP (di-nucleotide polymorphism) but for three consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of four or more (SNP, DNP, TNP, ONP, INS, DEL, or Consolidated)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'DOMAINS',
  'endpoint': 'mutation',
  'description': 'The source and identifier of any overlapping protein domains',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'method_of_diagnosis',
  'endpoint': 'diagnosis',
  'description': 'The method used to confirm the subjects malignant diagnosis.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Tumor_Sample_UUID',
  'endpoint': 'mutation',
  'description': 'Unique GDC identifier for tumor aliquot (10189 unique)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Center',
  'endpoint': 'mutation',
  'description': 'One or more genome sequencing center reporting the variant',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'anatomical_site',
  'endpoint': 'specimen',
  'description': 'Per GDC Dictionary, the text term that represents the name of the primary disease site of the submitted tumor sample; recommend dropping tumor; biospecimen_anatomic_site.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'specimen_associated_project',
  'endpoint': 'specimen',
  'description': 'The Project associated with the specimen.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'byte_size',
  'endpoint': 'file',
  'description': 'Size of the file in bytes. Maps to dcat:byteSize.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'sample_barcode_tumor',
  'endpoint': 'mutation',
  'description': 'TCGA sample barcode for the tumor, eg TCGA-12-1089-01A. One sample may have multiple sets of CN segmentations corresponding to multiple aliquots; use GROUP BY appropriately in queries',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'sex',
  'endpoint': 'subject',
  'description': "The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'number_of_cycles',
  'endpoint': 'treatment',
  'description': 'The number of treatment cycles the subject received.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'Amino_acids',
  'endpoint': 'mutation',
  'description': 'Amino acid substitution caused by the mutation. Only given if the variation affects the protein-coding sequence',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'days_to_birth',
  'endpoint': 'subject',
  'description': "Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.",
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'EA_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in NHLBI-ESP European American population',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Allele',
  'endpoint': 'mutation',
  'description': 'The variant allele used to calculate the consequence',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'cDNA_position',
  'endpoint': 'mutation',
  'description': 'Relative position of base pair in the cDNA sequence as a fraction. A - symbol is displayed as the numerator if the variant does not appear in cDNA',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'case_barcode',
  'endpoint': 'mutation',
  'description': 'Original TCGA case barcode, eg TCGA-DX-A8BN',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'stage',
  'endpoint': 'diagnosis',
  'description': 'The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'species',
  'endpoint': 'subject',
  'description': 'The taxonomic group (e.g. species) of the patient. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text.  Ultimately, this will be a term returned by the vocabulary service.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'derived_from_specimen',
  'endpoint': 'specimen',
  'description': 'A source/parent specimen from which this one was directly derived.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_Adj',
  'endpoint': 'mutation',
  'description': 'Adjusted Global Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'diagnosis_id',
  'endpoint': 'diagnosis',
  'description': "The 'logical' identifier of the entity in the repository, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'treatment_outcome',
  'endpoint': 'treatment',
  'description': 'The final outcome of the treatment.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_NFE',
  'endpoint': 'mutation',
  'description': 'Non-Finnish European Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'source_material_type',
  'endpoint': 'specimen',
  'description': 'The general kind of material from which the specimen was derived, indicating the physical nature of the source material. ',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'One_Consequence',
  'endpoint': 'mutation',
  'description': 'The single consequence of the canonical transcript in  sequence ontology terms, eg missense_variant',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'aliquot_barcode_normal',
  'endpoint': 'mutation',
  'description': 'TCGA aliquot barcode for the normal control, eg TCGA-12-1089-01A-01D-0517-01',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'SIFT',
  'endpoint': 'mutation',
  'description': 'The   SIFT prediction and/or score, with both given as prediction (score)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'days_to_treatment_end',
  'endpoint': 'treatment',
  'description': ' The timepoint at which the treatment ended.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'Matched_Norm_Sample_UUID',
  'endpoint': 'mutation',
  'description': 'Unique GDC identifier for normal aliquot (10189 unique)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'specimen_identifier_system',
  'endpoint': 'specimen',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'INTRON',
  'endpoint': 'mutation',
  'description': 'The intron number (out of total number)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'src_vcf_id',
  'endpoint': 'mutation',
  'description': '|-delimited list of GDC VCF file identifiers',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_type',
  'endpoint': 'treatment',
  'description': 'The treatment type including medication/therapeutics or other procedures.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'TREMBL',
  'endpoint': 'mutation',
  'description': 'UniProtKB/TrEMBL identifier of protein product',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'derived_from_subject',
  'endpoint': 'specimen',
  'description': 'The Patient/ResearchSubject, or Biologically Derived Materal (e.g. a cell line, tissue culture, organoid) from which the specimen was directly or indirectly derived.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Variant_Classification',
  'endpoint': 'mutation',
  'description': 'Translational effect of variant allele',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'callerName',
  'endpoint': 'mutation',
  'description': '|-delimited list of mutation caller(s) that agreed on this particular call, always in alphabetical order: muse, mutect, somaticsniper, varscan',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'specimen_id',
  'endpoint': 'specimen',
  'description': "The 'logical' identifier of the entity in the system of record, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'Mutation_Status',
  'endpoint': 'mutation',
  'description': 'An assessment of the mutation as somatic, germline, LOH, post transcriptional modification, unknown, or none. The values allowed in this field are constrained by the value in the Validation_Status field',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'AMR_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in 1000 Genomes combined American population',
  'type': 'FLOAT',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_end_reason',
  'endpoint': 'treatment',
  'description': 'The reason the treatment ended.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'therapeutic_agent',
  'endpoint': 'treatment',
  'description': 'One or more therapeutic agents as part of this treatment.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'case_id',
  'endpoint': 'mutation',
  'description': 'Unique GDC identifier for the underlying case',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'imaging_series',
  'endpoint': 'file',
  'description': "The 'logical' identifier of the series or grouping of imaging files in the system of record which the file is a part of.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'EAS_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in 1000 Genomes combined East Asian population',
  'type': 'FLOAT',
  'mode': 'NULLABLE'},
 {'fieldName': 'Hugo_Symbol',
  'endpoint': 'mutation',
  'description': 'HUGO symbol for the gene (HUGO symbols are always in all caps). Unknown is used for regions that do not correspond to a gene',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Sequencer',
  'endpoint': 'mutation',
  'description': 'Instrument used to produce primary sequence data',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'CANONICAL',
  'endpoint': 'mutation',
  'description': 'A flag (YES) indicating that the VEP-based canonical transcript, the longest translation, was used for this gene. If not, the value is null',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'vital_status',
  'endpoint': 'subject',
  'description': 'Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'diagnosis_identifier_value',
  'endpoint': 'diagnosis',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'all_effects',
  'endpoint': 'mutation',
  'description': 'A semicolon delimited list of all possible variant effects, sorted by priority ([Symbol,Consequence,HGVSp_Short,Transcript_ID,RefSeq,HGVSc,Impact,Canonical,Sift,PolyPhen,Strand])',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_EAS',
  'endpoint': 'mutation',
  'description': 'East Asian Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'TRANSCRIPT_STRAND',
  'endpoint': 'mutation',
  'description': 'The DNA strand (1 or -1) on which the transcript/feature lies',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'Consequence',
  'endpoint': 'mutation',
  'description': 'Consequence type of this variant; sequence ontology terms',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'data_category',
  'endpoint': 'file',
  'description': 'Broad categorization of the contents of the data file.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_id',
  'endpoint': 'treatment',
  'description': "The 'logical' identifier of the entity in the repository, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'GMAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in   1000 Genomes',
  'type': 'FLOAT',
  'mode': 'NULLABLE'},
 {'fieldName': 'TSL',
  'endpoint': 'mutation',
  'description': 'Transcript support level, which is based on independent RNA analyses',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'SOMATIC',
  'endpoint': 'mutation',
  'description': 'Somatic status of each ID reported under Existing_variation (0, 1, or null)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'IMPACT',
  'endpoint': 'mutation',
  'description': 'The impact modifier for the consequence type',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'CDS_position',
  'endpoint': 'mutation',
  'description': 'Relative position of base pair in coding sequence. A - symbol is displayed as the numerator if the variant does not appear in coding sequence',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'COSMIC',
  'endpoint': 'mutation',
  'description': 'Overlapping COSMIC variants',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_anatomic_site',
  'endpoint': 'treatment',
  'description': 'The anatomical site that the treatment targets.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'tumor_bam_uuid',
  'endpoint': 'mutation',
  'description': 'Unique GDC identifier for the underlying bam file',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'primary_diagnosis_site',
  'endpoint': 'researchsubject',
  'description': "The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This categorization groups cases into general categories.  This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'file_id',
  'endpoint': 'file',
  'description': "The 'logical' identifier of the entity in the repository, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'sample_barcode_normal',
  'endpoint': 'mutation',
  'description': 'TCGA sample barcode for the normal control, eg TCGA-12-1089-01A. One sample may have multiple sets of CN segmentations corresponding to multiple aliquots; use GROUP BY appropriately in queries',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Protein_position',
  'endpoint': 'mutation',
  'description': 'Relative position of affected amino acid in protein. A - symbol is displayed as the numerator if the variant does not appear in coding sequence',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Codons',
  'endpoint': 'mutation',
  'description': 'The alternative codons with the variant base in upper case',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'subject_id',
  'endpoint': 'subject',
  'description': "The 'logical' identifier of the entity in the system of record, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'cause_of_death',
  'endpoint': 'subject',
  'description': 'Coded value indicating the circumstance or condition that results in the death of the subject.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'primary_disease_type',
  'endpoint': 'specimen',
  'description': "The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O).   This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'primary_diagnosis',
  'endpoint': 'diagnosis',
  'description': 'The diagnosis instance that qualified a subject for inclusion on a ResearchProject.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'checksum',
  'endpoint': 'file',
  'description': 'A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'aliquot_barcode_tumor',
  'endpoint': 'mutation',
  'description': 'TCGA aliquot barcode for the tumor, eg TCGA-12-1089-01A-01D-0517-01',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'SWISSPROT',
  'endpoint': 'mutation',
  'description': 'UniProtKB/Swiss-Prot accession',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_FIN',
  'endpoint': 'mutation',
  'description': 'Finnish Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'EUR_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in 1000 Genomes combined European population',
  'type': 'FLOAT',
  'mode': 'NULLABLE'},
 {'fieldName': 'Feature_type',
  'endpoint': 'mutation',
  'description': 'Type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature (or blank)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'file_format',
  'endpoint': 'file',
  'description': 'Format of the data files.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'HGVS_OFFSET',
  'endpoint': 'mutation',
  'description': 'Indicates by how many bases the HGVS notations for this variant have been shifted',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'PolyPhen',
  'endpoint': 'mutation',
  'description': 'The PolyPhen prediction and/or score',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'FILTER',
  'endpoint': 'mutation',
  'description': 'Copied from input VCF. This includes filters implemented directly by the variant caller and other external software used in the DNA-Seq pipeline. See below for additional details.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_identifier_system',
  'endpoint': 'treatment',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'race',
  'endpoint': 'subject',
  'description': 'An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'End_Position',
  'endpoint': 'mutation',
  'description': 'Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'ENSP',
  'endpoint': 'mutation',
  'description': 'The Ensembl protein identifier of the affected transcript',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF',
  'endpoint': 'mutation',
  'description': 'Global Allele Frequency from   ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Chromosome',
  'endpoint': 'mutation',
  'description': 'Chromosome, possible values: chr1-22, and chrX',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 't_ref_count',
  'endpoint': 'mutation',
  'description': 'Read depth supporting the reference allele in tumor BAM',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'subject_identifier_system',
  'endpoint': 'subject',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'subject_identifier_value',
  'endpoint': 'subject',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'label',
  'endpoint': 'file',
  'description': 'Short name or abbreviation for dataset. Maps to rdfs:label.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_effect',
  'endpoint': 'treatment',
  'description': 'The effect of a treatment on the diagnosis or tumor.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Tumor_Seq_Allele2',
  'endpoint': 'mutation',
  'description': 'Primary data genotype for tumor sequencing (discovery) allele 2. A - symbol for a deletion represents a variant. A - symbol for an insertion represents wild-type allele. Novel inserted sequence for insertion does not include flanking reference bases',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'HGVSp',
  'endpoint': 'mutation',
  'description': 'The protein sequence of the variant in HGVS recommended format. p.= signifies no change in the protein',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'data_modality',
  'endpoint': 'file',
  'description': 'Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'researchsubject_identifier_value',
  'endpoint': 'researchsubject',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Tumor_Seq_Allele1',
  'endpoint': 'mutation',
  'description': 'Primary data genotype for tumor sequencing (discovery) allele 1. A - symbol for a deletion represents a variant. A - symbol for an insertion represents wild-type allele. Novel inserted sequence for insertion does not include flanking reference bases',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'specimen_identifier_value',
  'endpoint': 'specimen',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'CCDS',
  'endpoint': 'mutation',
  'description': 'The  CCDS identifier for this transcript, where applicable',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'EXON',
  'endpoint': 'mutation',
  'description': 'The exon number (out of total number)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'ExAC_AF_OTH',
  'endpoint': 'mutation',
  'description': 'Other Allele Frequency from ExAC',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'file_identifier_system',
  'endpoint': 'file',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Tumor_Validation_Allele2',
  'endpoint': 'mutation',
  'description': 'Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'SAS_MAF',
  'endpoint': 'mutation',
  'description': 'Non-reference allele and frequency of existing variant in 1000 Genomes combined South Asian population',
  'type': 'FLOAT',
  'mode': 'NULLABLE'},
 {'fieldName': 'Tumor_Validation_Allele1',
  'endpoint': 'mutation',
  'description': 'Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. A - symbol for a deletion represents a variant. A - symbol for an insertion represents wild-type allele. Novel inserted sequence for insertion does not include flanking reference bases',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'NCBI_Build',
  'endpoint': 'mutation',
  'description': 'The reference genome used for the alignment (GRCh38)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'data_type',
  'endpoint': 'file',
  'description': 'Specific content type of the data file.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'Exon_Number',
  'endpoint': 'mutation',
  'description': 'The exon number (out of total number)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'normal_bam_uuid',
  'endpoint': 'mutation',
  'description': 'Unique GDC identifier for the underlying normal bam file',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'MINIMISED',
  'endpoint': 'mutation',
  'description': 'Alleles in this variant have been converted to minimal representation before consequence calculation (1 or null)',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'HGVSc',
  'endpoint': 'mutation',
  'description': 'The coding sequence of the variant in HGVS recommended format',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'PUBMED',
  'endpoint': 'mutation',
  'description': 'Pubmed ID(s) of publications that cite existing variant',
  'type': 'STRING',
  'mode': 'NULLABLE'}]

By default, columns() returns all columns. If that is too many, you can filter the list for terms that match your interests:

In [4]:

            
                Copied!
                
columns().to_list(filters="diagnosis")
columns().to_list(filters="diagnosis")

Out[4]:

[{'fieldName': 'age_at_diagnosis',
  'endpoint': 'diagnosis',
  'description': 'The age in days of the individual at the time of diagnosis.',
  'type': 'INTEGER',
  'mode': 'NULLABLE'},
 {'fieldName': 'morphology',
  'endpoint': 'diagnosis',
  'description': 'Code that represents the histology of the disease using the third edition of the International Classification of Diseases for Oncology, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'diagnosis_identifier_system',
  'endpoint': 'diagnosis',
  'description': 'The system or namespace that defines the identifier.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'grade',
  'endpoint': 'diagnosis',
  'description': 'The degree of abnormality of cancer cells, a measure of differentiation, the extent to which cancer cells are similar in appearance and function to healthy cells of the same tissue type. The degree of differentiation often relates to the clinical behavior of the particular tumor. Based on the microscopic findings, tumor grade is commonly described by one of four degrees of severity. Histopathologic grade of a tumor may be used to plan treatment and estimate the future course, outcome, and overall prognosis of disease. Certain types of cancers, such as soft tissue sarcoma, primary brain tumors, lymphomas, and breast have special grading systems.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'primary_diagnosis_condition',
  'endpoint': 'researchsubject',
  'description': "The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O).   This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'method_of_diagnosis',
  'endpoint': 'diagnosis',
  'description': 'The method used to confirm the subjects malignant diagnosis.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'stage',
  'endpoint': 'diagnosis',
  'description': 'The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'diagnosis_id',
  'endpoint': 'diagnosis',
  'description': "The 'logical' identifier of the entity in the repository, e.g. a UUID.  This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",
  'type': 'STRING',
  'mode': 'REQUIRED'},
 {'fieldName': 'diagnosis_identifier_value',
  'endpoint': 'diagnosis',
  'description': 'The value of the identifier, as defined by the system.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'primary_diagnosis_site',
  'endpoint': 'researchsubject',
  'description': "The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This categorization groups cases into general categories.  This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.",
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'primary_diagnosis',
  'endpoint': 'diagnosis',
  'description': 'The diagnosis instance that qualified a subject for inclusion on a ResearchProject.',
  'type': 'STRING',
  'mode': 'NULLABLE'},
 {'fieldName': 'treatment_effect',
  'endpoint': 'treatment',
  'description': 'The effect of a treatment on the diagnosis or tumor.',
  'type': 'STRING',
  'mode': 'NULLABLE'}]

Lists are good for processing the information computationally, but for browsing, you may prefer to view columns as a dataframe:

In [5]:

            
                Copied!
                
columns().to_dataframe()
columns().to_dataframe()

Out[5]:

fieldName	endpoint	description	type	mode
Loading... (need help?)

Because we loaded itables, there is a built in search for dataframes, but if you are working on the command line, or just don't want to use itables, you can also search any field(s) for a desired value:

In [6]:

            
                Copied!
                
columns().to_dataframe(search_fields = ["description", "fieldName"], search_value= "tumor")
columns().to_dataframe(search_fields = ["description", "fieldName"], search_value= "tumor")

Out[6]:

	fieldName	endpoint	description	type	mode
Loading... (need help?)

Check your search criteria! While available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. Be sure you understand the description before relying on any given data point. For further information see ehe field name mappings at CDA Schema Field Mapping.

unique_terms¶

We can directly get information about what data populates any of these fields using the unique_terms() function. Like columns, unique_terms defaults to giving us an overview of the results, and we view them the same way:

In [7]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list()
unique_terms("primary_diagnosis_site").to_list()

Out[7]:

[None,
 'Abdomen',
 'Abdomen, Mediastinum',
 'Abdomen, Pelvis',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other and ill-defined digestive organs',
 'Other and ill-defined sites',
 'Other and ill-defined sites in lip, oral cavity and pharynx',
 'Other and ill-defined sites within respiratory system and intrathoracic organs',
 'Other and unspecified female genital organs',
 'Other and unspecified major salivary glands',
 'Other and unspecified male genital organs',
 'Other and unspecified parts of biliary tract',
 'Other and unspecified parts of mouth',
 'Other and unspecified parts of tongue',
 'Other and unspecified urinary organs',
 'Other endocrine glands and related structures',
 'Ovary',
 'Palate',
 'Pancreas',
 'Pancreas ',
 'Pelvis, Prostate, Anus',
 'Penis',
 'Peripheral nerves and autonomic nervous system',
 'Phantom',
 'Prostate',
 'Prostate gland',
 'Rectosigmoid junction',
 'Rectum',
 'Renal pelvis',
 'Retroperitoneum and peritoneum',
 'Skin',
 'Small intestine',
 'Spinal cord, cranial nerves, and other parts of central nervous system',
 'Stomach',
 'Testicles',
 'Testis',
 'Thymus',
 'Thyroid',
 'Thyroid gland',
 'Tonsil',
 'Trachea',
 'Unknown',
 'Ureter',
 'Uterus',
 'Uterus, NOS',
 'Vagina',
 'Various',
 'Various (11 locations)',
 'Vulva']

When you are browsing for possible search terms, it can often be useful to see how much data they have. A quick way to see the overall volume of data for any given term is to use the show_counts option. This needs to be viewed as a dataframe since it is two dimensional data:

In [8]:

            
                Copied!
                
unique_terms("primary_diagnosis_site", show_counts = True).to_dataframe()
unique_terms("primary_diagnosis_site", show_counts = True).to_dataframe()

Out[8]:

primary_diagnosis_site	Count
Loading... (need help?)

We can also use the filters option here to search for only diagnosis sites that we're interested in:

In [9]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="lung")
unique_terms("primary_diagnosis_site").to_list(filters="lung")

Out[9]:

['Bronchus and lung', 'Lung', 'Lung Phantom']

filters looks for both full and partial matches, which is useful for searching unharmonized data. For instance, if I'm not sure whether the data I'm interested in would be labeled as "uterine" or "uterus" I might search for just "uter"

In [10]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="uter")
unique_terms("primary_diagnosis_site").to_list(filters="uter")

Out[10]:

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

Success! Not only are there multiple ways that "Uterus" is specified in the CDA data, I now also know that there are also data for specific uterine tissues.

Check your search terms! If you run into unexpected results when running a search, be sure that you're searching all the terms you want. CDA data is not yet harmonized across centers, so there are many cases where a single term search will not return all the information you need, however the CDA provides tools that make it easy to search all forms of a term to enable cross dataset search.

However, if your filter is very short, or a very common word, this partial match behavior might give too many results. To force the search to only find exact matches, add the exact = True option:

In [11]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="lung", exact = True)
unique_terms("primary_diagnosis_site").to_list(filters="lung", exact = True)

Out[11]:

['Lung']

In [12]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="uter", exact = True)
unique_terms("primary_diagnosis_site").to_list(filters="uter", exact = True)

Out[12]:

[]

Explore the available terms by changing filters, how many results, and which unique terms you request. Once you have found terms you're interested in, head to Basic Search to build simple queries.

Last update: 2022-11-03