Summarize Search Results¶

The CDA provides a custom python tool for searching CDA data. Q (short for Query) offers several ways to search and filter data, and several input modes:

Q.() builds a query that can be used by run() or count()
Q.run() returns data for the specified search
Q.count() returns summary information (counts) data that fit the specified search
columns() returns entity field names
unique_terms() returns entity field contents

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt. settings are pre-configuring how itables should display our tables, with scrolling and paging enabled. Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:

            
                Copied!
                
                    
                    
                
                

        
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())

2022.12.21

CDA data comes from three sources:

The Proteomic Data Commons (PDC)
The Genomic Data Commons (GDC)
The Imaging Data Commons (IDC)

The CDA makes this data searchable in four main endpoints:

subject: A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.
researchsubject: A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. A subject who participates in 3 studies will have 3 researchsubject IDs.
specimen: Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.
mutation: Molecular data about specific mutations, currently limited to the TCGA-READ project from GDC.

one endpoint that can be added to the subject, researchsubject, or specimen to get the relevant files:

file: A unit of data about subjects, researchsubjects, specimens, or their associated information.

and two endpoints that offer deeper information about data in the researchsubject endpoint:

diagnosis: A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.
treatment: Represent medication administration or other treatment types.

If you are looking to build a cohort of distinct individuals who meet some criteria, search by subject. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by researchsubject. If you are looking for biosamples that can be ordered or a specific format of information (for e.g. histological slides) start with specimen. If you are primarily looking for files you can reuse for your own analysis, add .file to your call.

In the subject, researchsubject, or specimen tables, all of the rows will have one or more files associated with them that can be directly found by chaining, as in specimen.files. Diagnosis and treatment do not have files directly associated with them, so a query statemet of diagnosis.files or treatment.file will not work. The mutation table does have files associated with it, but currently they cannot be accessed with mutation.file. Look for this feature in a later release.

Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.

Getting simple summary data¶

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in Q and save it to a variable myquery. This is the same query we ran in the Basic Search notebook:

In [2]:

            
                Copied!
                
myquery = Q('primary_diagnosis_site = "brain"')
myquery = Q('primary_diagnosis_site = "brain"')

Where did those terms come from?

If you aren't sure how we knew what terms to put in our search, please refer back to the What search terms are available? notebook.

Overall summary¶

You can get a quick summary of how many unique specimens, treatments, diagnoses, researchsubjects and subjects meet your search criteria by chaining a count command into the basic run call.

In [3]:

            
                Copied!
                
myquery.count.run()
myquery.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.24 sec 3240 ms

specimen_count : 39220

treatment_count : 2396

diagnosis_count : 1757

mutation_count : 904

researchsubject_count : 4347

subject_count : 3015

Out[3]:

These numbers are how many total rows of data will come back when querying the various endpoints.

subject summary¶

We can also add countto the other run calls we did in the Basic Search notebook to get more detailed summaries:

In [4]:

            
                Copied!
                
subjectresults = myquery.subject.count.run()
subjectresults = myquery.subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.37 sec 3370 ms

In [5]:

            
                Copied!
                
myquery.subject.count.run()
myquery.subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.269 sec 3269 ms

  files : 4924982

    total : 3015

sex	count
None	1378
female	653
male	981
not reported	3

race	count
None	1378
white	1312
not reported	136
asian	33
black or african american	96
american indian or alaska native	4
Unknown	21
other	9
not allowed to collect	25
native hawaiian or other pacific islander	1

ethnicity	count
None	1378
not hispanic or latino	1286
not reported	219
hispanic or latino	85
Unknown	22
not allowed to collect	25

cause_of_death	count
None	2746
Not Reported	199
Cancer Related	48
Surgical Complications	2
Unknown	8
Not Cancer Related	9
Infection	3

subject_identifier_system	count
IDC	2585
PDC	309
GDC	1455

Out[5]:

Since we save the output as a variable, we need to look at the variable to see the results:

In [6]:

            
                Copied!
                
subjectresults
subjectresults

  files : 4924982

    total : 3015

sex	count
None	1378
female	653
male	981
not reported	3

race	count
None	1378
white	1312
not reported	136
asian	33
black or african american	96
american indian or alaska native	4
Unknown	21
other	9
not allowed to collect	25
native hawaiian or other pacific islander	1

ethnicity	count
None	1378
not hispanic or latino	1286
not reported	219
hispanic or latino	85
Unknown	22
not allowed to collect	25

cause_of_death	count
None	2746
Not Reported	199
Cancer Related	48
Surgical Complications	2
Unknown	8
Not Cancer Related	9
Infection	3

subject_identifier_system	count
IDC	2585
PDC	309
GDC	1455

Out[6]:

By default, the results are displayed as a table for easy previewing of the data. Since we queried the subject endpoint, our default results tell us subject level information, that is, information about unique individuals: their sex, race, age, species, etc. Using counts gives us back a nice pivot table type summary of the countable fields for Subjects. Note that above the table it also tells you the total subject count, as well as how many files are associated with those subjects.

Subject Field Definitions

A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.

id (`total`)	The overall number of subjects returned.
files	The number of files that match this search.
identifier.value(`system`)	The identifier for the data provider.
species	The taxonomic group (e.g. species) of the subject.
sex	The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.
race	An arbitrary classification of a taxonomic group that is a division of a species.
ethnicity	An individual's self-described social and cultural grouping.
cause_of_death	The cause of death, if known

This gives you a quick way to assess whether the full search results will have the data fields you require. But if you want to get the underlying data for your own downstream applications, you can also get the raw numbers by calling the zeroth value of the variable:

In [7]:

            
                Copied!
                
subjectresults[0]
subjectresults[0]

Out[7]:

{'files': 4924982,
 'total': 3015,
 'sex': [{'sex': 'NULL', 'count': 1378},
  {'sex': 'female', 'count': 653},
  {'sex': 'male', 'count': 981},
  {'sex': 'not reported', 'count': 3}],
 'race': [{'race': 'NULL', 'count': 1378},
  {'race': 'white', 'count': 1312},
  {'race': 'not reported', 'count': 136},
  {'race': 'asian', 'count': 33},
  {'race': 'black or african american', 'count': 96},
  {'race': 'american indian or alaska native', 'count': 4},
  {'race': 'Unknown', 'count': 21},
  {'race': 'other', 'count': 9},
  {'race': 'not allowed to collect', 'count': 25},
  {'race': 'native hawaiian or other pacific islander', 'count': 1}],
 'ethnicity': [{'ethnicity': 'NULL', 'count': 1378},
  {'ethnicity': 'not hispanic or latino', 'count': 1286},
  {'ethnicity': 'not reported', 'count': 219},
  {'ethnicity': 'hispanic or latino', 'count': 85},
  {'ethnicity': 'Unknown', 'count': 22},
  {'ethnicity': 'not allowed to collect', 'count': 25}],
 'cause_of_death': [{'cause_of_death': 'NULL', 'count': 2746},
  {'cause_of_death': 'Not Reported', 'count': 199},
  {'cause_of_death': 'Cancer Related', 'count': 48},
  {'cause_of_death': 'Surgical Complications', 'count': 2},
  {'cause_of_death': 'Unknown', 'count': 8},
  {'cause_of_death': 'Not Cancer Related', 'count': 9},
  {'cause_of_death': 'Infection', 'count': 3}],
 'subject_identifier_system': [{'subject_identifier_system': 'IDC',
   'count': 2585},
  {'subject_identifier_system': 'PDC', 'count': 309},
  {'subject_identifier_system': 'GDC', 'count': 1455}]}

researchsubject¶

If we're interested in what researchsubjects meet our criteria, we can also run our query against the researchsubject endpoint. Lets run it without saving to a variable this time to make it a bit quicker:

In [8]:

            
                Copied!
                
myquery.researchsubject.count.run()
myquery.researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.302 sec 3302 ms

  files : 4924962

    total : 4347

primary_diagnosis_condition	count
Gliomas	1247
Glioblastoma	100
Other	10
None	2583
Pediatric/AYA Brain Tumors	199
Neoplasms, NOS	66
Germ Cell Neoplasms	104
Not Reported	11
Not Applicable	9
Malignant Lymphomas, NOS or Diffuse	14
Mature B-Cell Lymphomas	2
Neuroepitheliomatous Neoplasms	2

primary_diagnosis_site	count
Brain	4347

researchsubject_identifier_system	count
GDC	1455
PDC	309
IDC	2583

Out[8]:

ResearchSubject Field Definitions

A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs

id (`total`)	The overall number of researchsubjects returned.
files	The number of files that match this search.
identifier.value(`system`)	The identifier for the data provider.
primary_diagnosis_condition	The text term used to describe the type of malignant disease.
primary_diagnosis_site	The text term used to describe the primary site of disease.

diagnosis¶

The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria. :

In [9]:

            
                Copied!
                
myquery.diagnosis.count.run()
myquery.diagnosis.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.32 sec 3320 ms

    total : 1757

primary_diagnosis	count
Glioblastoma	822
Glioblastoma multiforme	4
Astrocytoma, anaplastic	130
Craniopharyngioma	16
Oligodendroglioma, anaplastic	78
Astrocytoma, NOS	64
Mixed glioma	131
Ependymoma, NOS	32
Malignant lymphoma, NOS	14
Glioma, malignant	26
Ganglioglioma, NOS	18
Oligodendroglioma, NOS	112
Yolk sac tumor	8
Neoplasm, malignant	50
Medulloblastoma, NOS	22
Glioma, NOS	93
Mixed germ cell tumor	79
Malignant lymphoma, large B-cell, diffuse, NOS	2
Gliosarcoma	1
Embryonal carcinoma, NOS	8
Atypical teratoid/rhabdoid tumor	12
Neoplasm, uncertain whether benign or malignant	13
Teratoma, benign	3
Germinoma	4
Not Reported	10
Papillary glioneuronal tumor	2
Teratoma, malignant, NOS	2
Oligoastrocytoma	1

stage	count
None	1428
Unknown	219
Not Reported	110

grade	count
High Grade	26
not reported	1116
Not Reported	392
G1	98
G2	52
G4	36
None	28
Low Grade	9

diagnosis_identifier_system	count
GDC	1428
PDC	329

Out[9]:

Diagnosis Field Definitions

A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.

id (`total`)	The overall number of diagnoses returned.
identifier.value(`system`)	The identifier for the data provider.
primary_diagnosis	The diagnosis instance that qualified a subject for inclusion on a ResearchProject.
stage	The extent of a cancer in the body.
grade	The degree of abnormality of cancer cells.

treatment¶

The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:

In [10]:

            
                Copied!
                
myquery.treatment.count.run()
myquery.treatment.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.293 sec 3293 ms

    total : 2396

treatment_type	count
None	2396

treatment_effect	count
None	2396

treatment_identifier_system	count
GDC	2396

Out[10]:

Treatment Field Definitions

Medication administration or other treatment types. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studies

id (`total`)	The overall number of treatments returned.
identifier.value(`system`)	The identifier for the data provider.
treatment_type	The treatment type including medication/therapeutics or other procedures.
treatment_effect	The effect of a treatment on the diagnosis or tumor.

specimens¶

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [11]:

            
                Copied!
                
myquery.specimen.count.run()
myquery.specimen.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.34 sec 3340 ms

   files : 47327

   total : 39220

primary_disease_type	count
Gliomas	37586
Glioblastoma	200
Other	20
Not Reported	121
Pediatric/AYA Brain Tumors	438
Germ Cell Neoplasms	416
Neoplasms, NOS	285
Not Applicable	36
Malignant Lymphomas, NOS or Diffuse	56
Mature B-Cell Lymphomas	54
Neuroepitheliomatous Neoplasms	8

source_material_type	count
Primary Tumor	27578
Blood Derived Normal	10078
Recurrent Tumor	513
Solid Tissue Normal	538
Next Generation Cancer Model	176
Metastatic	252
Expanded Next Generation Cancer Model	35
Not Reported	36
Buccal Cell Normal	14

specimen_type	count
aliquot	18701
analyte	6676
slide	3754
sample	4093
portion	5996

specimen_identifier_system	count
GDC	38562
PDC	658

Out[11]:

Nearly 40,000 specimens with over 50,000 files meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests.

Specimen Field Definitions

Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation. A given specimen will have only a single subject ID and a single research subject ID

id (`total`)	The overall number of specimens returned.
files	The number of files that match this search.
identifier.value(`system`)	The identifier for the data provider.
primary_disease_type	The text term used to describe the type of malignant disease.
source_material_type	The general kind of material from which the specimen was derived.
specimen_type	The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide.

file¶

The file endpoint returns all files that match our query:

In [12]:

            
                Copied!
                
myquery.file.count.run()
myquery.file.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.298 sec 3298 ms

  total : 4924982

data_category	count
Imaging	4874560
Sequencing Reads	5897
Simple Nucleotide Variation	16970
Peptide Spectral Matches	1524
Raw Mass Spectra	762
Structural Variation	3384
Biospecimen	5583
Copy Number Variation	6913
Processed Mass Spectra	762
DNA Methylation	3342
Transcriptome Profiling	3116
Clinical	1187
Proteome Profiling	679
Somatic Structural Variation	303

data_type	count
None	4874560
Gene Level Copy Number	1222
Transcript Fusion	3385
Aligned Reads	5897
Proprietary	762
Protein Expression Quantification	679
Raw Simple Somatic Mutation	5033
Annotated Somatic Mutation	9462
Masked Intensities	2228
Open Standard	1524
Allele-specific Copy Number Segment	1071
Clinical Supplement	1182
Biospecimen Supplement	1954
Text	762
Slide Image	3629
Copy Number Segment	2336
Masked Copy Number Segment	2185
Structural Rearrangement	302
Masked Somatic Mutation	1146
Masked Annotated Somatic Mutation	183
Isoform Expression Quantification	643
Gene Expression Quantification	906
Methylation Beta Value	1114
Aggregated Somatic Mutation	1146
miRNA Expression Quantification	643
Splice Junction Quantification	870
Differential Gene Expression	18
Gene Level Copy Number Scores	99
Pathology Report	5
Single Cell Analysis	36

file_format	count
DICOM	4874560
MAF	7206
mzML	762
IDAT	2228
VCF	9915
mzIdentML	762
vendor-specific	762
BAM	5897
tsv	762
BCR Biotab	49
SVS	3629
TXT	8851
BCR SSF XML	758
TSV	4561
BEDPE	1892
BCR XML	2282
HDF5	18
MEX	36
CDC JSON	28
BCR OMF XML	19
PDF	5

file_identifier_system	count
IDC	4874560
GDC	47374
PDC	3048

Out[12]:

There are a huge number of files (4099497) that match our search. Likely we would want to additionally filter the results by file format or data type to get only files we can use. See all the ways you can filter and refine searches with more search terms in the Operators notebook.

File Field Definitions

A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.

id (`total`)	The overall number of files returned.
identifier.value(`system`)	The identifier for the data provider.
data_catagory	Broad categorization of the contents of the data file.
data_type	Specific content type of the data file.
file_format	Format of the data files.

mutation¶

The mutation endpoint returns all mutations that match our query:

In [13]:

            
                Copied!
                
myquery.mutation.count.run()
myquery.mutation.count.run()

Getting results from database

                Http Status: 500
                Error Message: Unrecognized name: case_barcode at [1:784]

                        Total execution time: 0
                        min 0.643 sec 643 ms

Files from a single endpoint (endpoint chaining)¶

If you want all file formats and data types, but only from a specific endpoint, you can also filter the file results by chaining endpoints together. This will return all the files that match our search AND that are specifically from specimens:

In [14]:

            
                Copied!
                
myquery.specimen.file.count.run()
myquery.specimen.file.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.295 sec 3295 ms

   total : 47327

data_category	count
Sequencing Reads	5897
Copy Number Variation	6959
Peptide Spectral Matches	1524
Simple Nucleotide Variation	16970
Structural Variation	3384
Processed Mass Spectra	762
Raw Mass Spectra	762
Biospecimen	3629
Proteome Profiling	679
Transcriptome Profiling	3116
Somatic Structural Variation	303
DNA Methylation	3342

data_type	count
Masked Copy Number Segment	2231
Raw Simple Somatic Mutation	5033
Aligned Reads	5897
Open Standard	1524
Gene Expression Quantification	906
Copy Number Segment	2336
Splice Junction Quantification	870
Allele-specific Copy Number Segment	1071
Annotated Somatic Mutation	9462
Proprietary	762
Transcript Fusion	3385
Text	762
Gene Level Copy Number	1222
Protein Expression Quantification	679
Masked Somatic Mutation	1146
Aggregated Somatic Mutation	1146
miRNA Expression Quantification	643
Slide Image	3629
Masked Intensities	2228
Isoform Expression Quantification	643
Methylation Beta Value	1114
Masked Annotated Somatic Mutation	183
Structural Rearrangement	302
Single Cell Analysis	36
Gene Level Copy Number Scores	99
Differential Gene Expression	18

file_format	count
VCF	9915
TSV	4561
TXT	8897
MAF	7206
BAM	5897
tsv	762
vendor-specific	762
mzML	762
mzIdentML	762
IDAT	2228
SVS	3629
BEDPE	1892
MEX	36
HDF5	18

file_identifier_system	count
GDC	44279
PDC	3048

Out[14]:

Learn more about chaining endpoints in the Chaining endpoints notebook.

Last update: 2022-11-03