Summarize Search Results¶
The CDA provides a custom python tool for searching CDA data. Q
(short for Query) offers several ways to search and filter data, and several input modes:
- Q.() builds a query that can be used by
run()
orcount()
- Q.run() returns data for the specified search
- Q.count() returns summary information (counts) data that fit the specified search
- columns() returns entity field names
- unique_terms() returns entity field contents
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt.
settings are pre-configuring how itables should display our tables, with scrolling and paging enabled.
Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
2022.12.21
- The Proteomic Data Commons (PDC)
- The Genomic Data Commons (GDC)
- The Imaging Data Commons (IDC)
- subject: A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.
- researchsubject: A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. A subject who participates in 3 studies will have 3 researchsubject IDs.
- specimen: Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.
- mutation: Molecular data about specific mutations, currently limited to the TCGA-READ project from GDC.
- file: A unit of data about subjects, researchsubjects, specimens, or their associated information.
- diagnosis: A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.
- treatment: Represent medication administration or other treatment types.
If you are looking to build a cohort of distinct individuals who meet some criteria, search by subject
. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by researchsubject
. If you are looking for biosamples that can be ordered or a specific format of information (for e.g. histological slides) start with specimen
. If you are primarily looking for files you can reuse for your own analysis, add .file
to your call.
In the subject, researchsubject, or specimen tables, all of the rows will have one or more files associated with them that can be directly found by chaining, as in specimen.files
. Diagnosis and treatment do not have files directly associated with them, so a query statemet of diagnosis.files
or treatment.file
will not work. The mutation table does have files associated with it, but currently they cannot be accessed with mutation.file
. Look for this feature in a later release.
Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
Getting simple summary data¶
Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in Q
and save it to a variable myquery
. This is the same query we ran in the Basic Search notebook:
myquery = Q('primary_diagnosis_site = "brain"')
Where did those terms come from?
If you aren't sure how we knew what terms to put in our search, please refer back to the What search terms are available? notebook.Overall summary¶
You can get a quick summary of how many unique specimens, treatments, diagnoses, researchsubjects and subjects meet your search criteria by chaining a count
command into the basic run
call.
myquery.count.run()
Getting results from database
Total execution time: 0 min 3.24 sec 3240 ms
specimen_count : 39220
treatment_count : 2396
diagnosis_count : 1757
mutation_count : 904
researchsubject_count : 4347
subject_count : 3015
These numbers are how many total rows of data will come back when querying the various endpoints.
subject summary¶
We can also add count
to the other run calls we did in the Basic Search notebook to get more detailed summaries:
subjectresults = myquery.subject.count.run()
Getting results from database
Total execution time: 0 min 3.37 sec 3370 ms
myquery.subject.count.run()
Getting results from database
Total execution time: 0 min 3.269 sec 3269 ms
files : 4924982
total : 3015
sex | count |
---|---|
None | 1378 |
female | 653 |
male | 981 |
not reported | 3 |
race | count |
---|---|
None | 1378 |
white | 1312 |
not reported | 136 |
asian | 33 |
black or african american | 96 |
american indian or alaska native | 4 |
Unknown | 21 |
other | 9 |
not allowed to collect | 25 |
native hawaiian or other pacific islander | 1 |
ethnicity | count |
---|---|
None | 1378 |
not hispanic or latino | 1286 |
not reported | 219 |
hispanic or latino | 85 |
Unknown | 22 |
not allowed to collect | 25 |
cause_of_death | count |
---|---|
None | 2746 |
Not Reported | 199 |
Cancer Related | 48 |
Surgical Complications | 2 |
Unknown | 8 |
Not Cancer Related | 9 |
Infection | 3 |
subject_identifier_system | count |
---|---|
IDC | 2585 |
PDC | 309 |
GDC | 1455 |
Since we save the output as a variable, we need to look at the variable to see the results:
subjectresults
files : 4924982
total : 3015
sex | count |
---|---|
None | 1378 |
female | 653 |
male | 981 |
not reported | 3 |
race | count |
---|---|
None | 1378 |
white | 1312 |
not reported | 136 |
asian | 33 |
black or african american | 96 |
american indian or alaska native | 4 |
Unknown | 21 |
other | 9 |
not allowed to collect | 25 |
native hawaiian or other pacific islander | 1 |
ethnicity | count |
---|---|
None | 1378 |
not hispanic or latino | 1286 |
not reported | 219 |
hispanic or latino | 85 |
Unknown | 22 |
not allowed to collect | 25 |
cause_of_death | count |
---|---|
None | 2746 |
Not Reported | 199 |
Cancer Related | 48 |
Surgical Complications | 2 |
Unknown | 8 |
Not Cancer Related | 9 |
Infection | 3 |
subject_identifier_system | count |
---|---|
IDC | 2585 |
PDC | 309 |
GDC | 1455 |
By default, the results are displayed as a table for easy previewing of the data. Since we queried the subject
endpoint, our default results tell us subject
level information, that is, information about unique individuals: their sex, race, age, species, etc. Using counts gives us back a nice pivot table type summary of the countable fields for Subjects. Note that above the table it also tells you the total subject count, as well as how many files are associated with those subjects.
Subject Field Definitions
A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.id (`total`) | The overall number of subjects returned. |
files | The number of files that match this search. |
identifier.value(`system`) | The identifier for the data provider. |
species | The taxonomic group (e.g. species) of the subject. |
sex | The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics. |
race | An arbitrary classification of a taxonomic group that is a division of a species. |
ethnicity | An individual's self-described social and cultural grouping. |
cause_of_death | The cause of death, if known |
This gives you a quick way to assess whether the full search results will have the data fields you require. But if you want to get the underlying data for your own downstream applications, you can also get the raw numbers by calling the zeroth value of the variable:
subjectresults[0]
{'files': 4924982, 'total': 3015, 'sex': [{'sex': 'NULL', 'count': 1378}, {'sex': 'female', 'count': 653}, {'sex': 'male', 'count': 981}, {'sex': 'not reported', 'count': 3}], 'race': [{'race': 'NULL', 'count': 1378}, {'race': 'white', 'count': 1312}, {'race': 'not reported', 'count': 136}, {'race': 'asian', 'count': 33}, {'race': 'black or african american', 'count': 96}, {'race': 'american indian or alaska native', 'count': 4}, {'race': 'Unknown', 'count': 21}, {'race': 'other', 'count': 9}, {'race': 'not allowed to collect', 'count': 25}, {'race': 'native hawaiian or other pacific islander', 'count': 1}], 'ethnicity': [{'ethnicity': 'NULL', 'count': 1378}, {'ethnicity': 'not hispanic or latino', 'count': 1286}, {'ethnicity': 'not reported', 'count': 219}, {'ethnicity': 'hispanic or latino', 'count': 85}, {'ethnicity': 'Unknown', 'count': 22}, {'ethnicity': 'not allowed to collect', 'count': 25}], 'cause_of_death': [{'cause_of_death': 'NULL', 'count': 2746}, {'cause_of_death': 'Not Reported', 'count': 199}, {'cause_of_death': 'Cancer Related', 'count': 48}, {'cause_of_death': 'Surgical Complications', 'count': 2}, {'cause_of_death': 'Unknown', 'count': 8}, {'cause_of_death': 'Not Cancer Related', 'count': 9}, {'cause_of_death': 'Infection', 'count': 3}], 'subject_identifier_system': [{'subject_identifier_system': 'IDC', 'count': 2585}, {'subject_identifier_system': 'PDC', 'count': 309}, {'subject_identifier_system': 'GDC', 'count': 1455}]}
researchsubject¶
If we're interested in what researchsubjects meet our criteria, we can also run our query against the researchsubject endpoint. Lets run it without saving to a variable this time to make it a bit quicker:
myquery.researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.302 sec 3302 ms
files : 4924962
total : 4347
primary_diagnosis_condition | count |
---|---|
Gliomas | 1247 |
Glioblastoma | 100 |
Other | 10 |
None | 2583 |
Pediatric/AYA Brain Tumors | 199 |
Neoplasms, NOS | 66 |
Germ Cell Neoplasms | 104 |
Not Reported | 11 |
Not Applicable | 9 |
Malignant Lymphomas, NOS or Diffuse | 14 |
Mature B-Cell Lymphomas | 2 |
Neuroepitheliomatous Neoplasms | 2 |
primary_diagnosis_site | count |
---|---|
Brain | 4347 |
researchsubject_identifier_system | count |
---|---|
GDC | 1455 |
PDC | 309 |
IDC | 2583 |
ResearchSubject Field Definitions
A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDsid (`total`) | The overall number of researchsubjects returned. |
files | The number of files that match this search. |
identifier.value(`system`) | The identifier for the data provider. |
primary_diagnosis_condition | The text term used to describe the type of malignant disease. |
primary_diagnosis_site | The text term used to describe the primary site of disease. |
diagnosis¶
The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria. :
myquery.diagnosis.count.run()
Getting results from database
Total execution time: 0 min 3.32 sec 3320 ms
total : 1757
primary_diagnosis | count |
---|---|
Glioblastoma | 822 |
Glioblastoma multiforme | 4 |
Astrocytoma, anaplastic | 130 |
Craniopharyngioma | 16 |
Oligodendroglioma, anaplastic | 78 |
Astrocytoma, NOS | 64 |
Mixed glioma | 131 |
Ependymoma, NOS | 32 |
Malignant lymphoma, NOS | 14 |
Glioma, malignant | 26 |
Ganglioglioma, NOS | 18 |
Oligodendroglioma, NOS | 112 |
Yolk sac tumor | 8 |
Neoplasm, malignant | 50 |
Medulloblastoma, NOS | 22 |
Glioma, NOS | 93 |
Mixed germ cell tumor | 79 |
Malignant lymphoma, large B-cell, diffuse, NOS | 2 |
Gliosarcoma | 1 |
Embryonal carcinoma, NOS | 8 |
Atypical teratoid/rhabdoid tumor | 12 |
Neoplasm, uncertain whether benign or malignant | 13 |
Teratoma, benign | 3 |
Germinoma | 4 |
Not Reported | 10 |
Papillary glioneuronal tumor | 2 |
Teratoma, malignant, NOS | 2 |
Oligoastrocytoma | 1 |
stage | count |
---|---|
None | 1428 |
Unknown | 219 |
Not Reported | 110 |
grade | count |
---|---|
High Grade | 26 |
not reported | 1116 |
Not Reported | 392 |
G1 | 98 |
G2 | 52 |
G4 | 36 |
None | 28 |
Low Grade | 9 |
diagnosis_identifier_system | count |
---|---|
GDC | 1428 |
PDC | 329 |
Diagnosis Field Definitions
A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.id (`total`) | The overall number of diagnoses returned. |
identifier.value(`system`) | The identifier for the data provider. |
primary_diagnosis | The diagnosis instance that qualified a subject for inclusion on a ResearchProject. |
stage | The extent of a cancer in the body. |
grade | The degree of abnormality of cancer cells. |
treatment¶
The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:
myquery.treatment.count.run()
Getting results from database
Total execution time: 0 min 3.293 sec 3293 ms
total : 2396
treatment_type | count |
---|---|
None | 2396 |
treatment_effect | count |
---|---|
None | 2396 |
treatment_identifier_system | count |
---|---|
GDC | 2396 |
Treatment Field Definitions
Medication administration or other treatment types. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studiesid (`total`) | The overall number of treatments returned. |
identifier.value(`system`) | The identifier for the data provider. |
treatment_type | The treatment type including medication/therapeutics or other procedures. |
treatment_effect | The effect of a treatment on the diagnosis or tumor. |
specimens¶
We can use this same query to see what specimens are available for brain tissue at the CDA:
myquery.specimen.count.run()
Getting results from database
Total execution time: 0 min 3.34 sec 3340 ms
files : 47327
total : 39220
primary_disease_type | count |
---|---|
Gliomas | 37586 |
Glioblastoma | 200 |
Other | 20 |
Not Reported | 121 |
Pediatric/AYA Brain Tumors | 438 |
Germ Cell Neoplasms | 416 |
Neoplasms, NOS | 285 |
Not Applicable | 36 |
Malignant Lymphomas, NOS or Diffuse | 56 |
Mature B-Cell Lymphomas | 54 |
Neuroepitheliomatous Neoplasms | 8 |
source_material_type | count |
---|---|
Primary Tumor | 27578 |
Blood Derived Normal | 10078 |
Recurrent Tumor | 513 |
Solid Tissue Normal | 538 |
Next Generation Cancer Model | 176 |
Metastatic | 252 |
Expanded Next Generation Cancer Model | 35 |
Not Reported | 36 |
Buccal Cell Normal | 14 |
specimen_type | count |
---|---|
aliquot | 18701 |
analyte | 6676 |
slide | 3754 |
sample | 4093 |
portion | 5996 |
specimen_identifier_system | count |
---|---|
GDC | 38562 |
PDC | 658 |
Nearly 40,000 specimens with over 50,000 files meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests.
Specimen Field Definitions
Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation. A given specimen will have only a single subject ID and a single research subject IDid (`total`) | The overall number of specimens returned. |
files | The number of files that match this search. |
identifier.value(`system`) | The identifier for the data provider. |
primary_disease_type | The text term used to describe the type of malignant disease. |
source_material_type | The general kind of material from which the specimen was derived. |
specimen_type | The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide. |
file¶
The file endpoint returns all files that match our query:
myquery.file.count.run()
Getting results from database
Total execution time: 0 min 3.298 sec 3298 ms
total : 4924982
data_category | count |
---|---|
Imaging | 4874560 |
Sequencing Reads | 5897 |
Simple Nucleotide Variation | 16970 |
Peptide Spectral Matches | 1524 |
Raw Mass Spectra | 762 |
Structural Variation | 3384 |
Biospecimen | 5583 |
Copy Number Variation | 6913 |
Processed Mass Spectra | 762 |
DNA Methylation | 3342 |
Transcriptome Profiling | 3116 |
Clinical | 1187 |
Proteome Profiling | 679 |
Somatic Structural Variation | 303 |
data_type | count |
---|---|
None | 4874560 |
Gene Level Copy Number | 1222 |
Transcript Fusion | 3385 |
Aligned Reads | 5897 |
Proprietary | 762 |
Protein Expression Quantification | 679 |
Raw Simple Somatic Mutation | 5033 |
Annotated Somatic Mutation | 9462 |
Masked Intensities | 2228 |
Open Standard | 1524 |
Allele-specific Copy Number Segment | 1071 |
Clinical Supplement | 1182 |
Biospecimen Supplement | 1954 |
Text | 762 |
Slide Image | 3629 |
Copy Number Segment | 2336 |
Masked Copy Number Segment | 2185 |
Structural Rearrangement | 302 |
Masked Somatic Mutation | 1146 |
Masked Annotated Somatic Mutation | 183 |
Isoform Expression Quantification | 643 |
Gene Expression Quantification | 906 |
Methylation Beta Value | 1114 |
Aggregated Somatic Mutation | 1146 |
miRNA Expression Quantification | 643 |
Splice Junction Quantification | 870 |
Differential Gene Expression | 18 |
Gene Level Copy Number Scores | 99 |
Pathology Report | 5 |
Single Cell Analysis | 36 |
file_format | count |
---|---|
DICOM | 4874560 |
MAF | 7206 |
mzML | 762 |
IDAT | 2228 |
VCF | 9915 |
mzIdentML | 762 |
vendor-specific | 762 |
BAM | 5897 |
tsv | 762 |
BCR Biotab | 49 |
SVS | 3629 |
TXT | 8851 |
BCR SSF XML | 758 |
TSV | 4561 |
BEDPE | 1892 |
BCR XML | 2282 |
HDF5 | 18 |
MEX | 36 |
CDC JSON | 28 |
BCR OMF XML | 19 |
5 |
file_identifier_system | count |
---|---|
IDC | 4874560 |
GDC | 47374 |
PDC | 3048 |
There are a huge number of files (4099497) that match our search. Likely we would want to additionally filter the results by file format or data type to get only files we can use. See all the ways you can filter and refine searches with more search terms in the Operators notebook.
File Field Definitions
A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.id (`total`) | The overall number of files returned. |
identifier.value(`system`) | The identifier for the data provider. |
data_catagory | Broad categorization of the contents of the data file. |
data_type | Specific content type of the data file. |
file_format | Format of the data files. |
mutation¶
The mutation endpoint returns all mutations that match our query:
myquery.mutation.count.run()
Getting results from database
Http Status: 500 Error Message: Unrecognized name: case_barcode at [1:784]
Total execution time: 0 min 0.643 sec 643 ms
Files from a single endpoint (endpoint chaining)¶
If you want all file formats and data types, but only from a specific endpoint, you can also filter the file results by chaining endpoints together. This will return all the files that match our search AND that are specifically from specimens:
myquery.specimen.file.count.run()
Getting results from database
Total execution time: 0 min 3.295 sec 3295 ms
total : 47327
data_category | count |
---|---|
Sequencing Reads | 5897 |
Copy Number Variation | 6959 |
Peptide Spectral Matches | 1524 |
Simple Nucleotide Variation | 16970 |
Structural Variation | 3384 |
Processed Mass Spectra | 762 |
Raw Mass Spectra | 762 |
Biospecimen | 3629 |
Proteome Profiling | 679 |
Transcriptome Profiling | 3116 |
Somatic Structural Variation | 303 |
DNA Methylation | 3342 |
data_type | count |
---|---|
Masked Copy Number Segment | 2231 |
Raw Simple Somatic Mutation | 5033 |
Aligned Reads | 5897 |
Open Standard | 1524 |
Gene Expression Quantification | 906 |
Copy Number Segment | 2336 |
Splice Junction Quantification | 870 |
Allele-specific Copy Number Segment | 1071 |
Annotated Somatic Mutation | 9462 |
Proprietary | 762 |
Transcript Fusion | 3385 |
Text | 762 |
Gene Level Copy Number | 1222 |
Protein Expression Quantification | 679 |
Masked Somatic Mutation | 1146 |
Aggregated Somatic Mutation | 1146 |
miRNA Expression Quantification | 643 |
Slide Image | 3629 |
Masked Intensities | 2228 |
Isoform Expression Quantification | 643 |
Methylation Beta Value | 1114 |
Masked Annotated Somatic Mutation | 183 |
Structural Rearrangement | 302 |
Single Cell Analysis | 36 |
Gene Level Copy Number Scores | 99 |
Differential Gene Expression | 18 |
file_format | count |
---|---|
VCF | 9915 |
TSV | 4561 |
TXT | 8897 |
MAF | 7206 |
BAM | 5897 |
tsv | 762 |
vendor-specific | 762 |
mzML | 762 |
mzIdentML | 762 |
IDAT | 2228 |
SVS | 3629 |
BEDPE | 1892 |
MEX | 36 |
HDF5 | 18 |
file_identifier_system | count |
---|---|
GDC | 44279 |
PDC | 3048 |
Learn more about chaining endpoints in the Chaining endpoints notebook.