Cohort Building¶
Example use case:
Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.
Getting Started¶
The CDA provides a custom python tool for searching CDA data. Q
(short for Query) offers several ways to search and filter data, and several input modes:
- Q.() builds a query that can be used by
run()
orcount()
- Q.run() returns data for the specified search
- Q.count() returns summary information (counts) data that fit the specified search
- columns() returns entity field names
- unique_terms() returns entity field contents
Before Julia does any work, she needs to import these functions cdapython.
She'll also need to import pandas to work with dataframes and itables to display them nicely. The opt.
settings are pre-configuring how itables should display her tables, with scrolling and paging enabled.
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
2022.12.21
- The Proteomic Data Commons (PDC)
- The Genomic Data Commons (GDC)
- The Imaging Data Commons (IDC)
- subject: A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.
- researchsubject: A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs
- specimen: Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.
- file: A unit of data about subjects, researchsubjects, specimens, or their associated information
- mutation: Molecular data about specific mutations, currently limited to the TCGA-READ project from GDC.
- diagnosis: A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.
- treatment: Represent medication administration or other treatment types.
Finding Search Terms¶
Accordingly, to see what search fields are available, Julia starts by using the command columns
:
columns().to_dataframe()
fieldName | endpoint | description | type | mode |
---|---|---|---|---|
Loading... (need help?) |
There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she searches the description column for "diagnosis":
columns().to_dataframe(search_fields=["description", "fieldName"],search_value="diagnosis")
fieldName | endpoint | description | type | mode | |
---|---|---|---|---|---|
Loading... (need help?) |
Since Julia is interested specifically in uterine cancers, she looks for columns that appear to have anatomical data, and then uses the unique_terms
function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears, she uses the show_counts
flag so she can also see how much data is associated with each term:
unique_terms("treatment_anatomic_site", show_counts = True).to_dataframe()
treatment_anatomic_site | Count |
---|---|
Loading... (need help?) |
unique_terms("primary_diagnosis_site", show_counts = True).to_dataframe()
primary_diagnosis_site | Count |
---|---|
Loading... (need help?) |
Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:
unique_terms("primary_diagnosis_site").to_list(filters="uter")
['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']
Just to be sure, Julia also searches for any other instances of "cervix":
unique_terms("primary_diagnosis_site").to_list(filters="cerv")
['Cervix', 'Cervix uteri']
Building a Query¶
With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of Q
statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the =
operator to get only exact matches:
Tsite = Q('treatment_anatomic_site = "Cervix"')
However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, Q
also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big Q
statement to grab everything that is either 'uter' or 'cerv':
Dsite = Q('primary_diagnosis_site = "%uter%" OR primary_diagnosis_site = "%cerv%"')
Finally, Julia adds her two queries together into one large one:
ALLDATA = Tsite.OR(Dsite)
Looking at Summary Data¶
Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using count
:
ALLDATA.count.run()
Getting results from database
Total execution time: 0 min 3.281 sec 3281 ms
specimen_count : 41069
treatment_count : 3049
diagnosis_count : 3685
mutation_count : 903
researchsubject_count : 4869
subject_count : 3742
It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using count
:
ALLDATA.researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.321 sec 3321 ms
files : 322958
total : 4869
WARNING:matplotlib.font_manager:Matplotlib is building the font cache; this may take a moment.
primary_diagnosis_condition | count |
---|---|
Myomatous Neoplasms | 188 |
Squamous Cell Neoplasms | 609 |
Uterine Corpus Endometrial Carcinoma | 104 |
Adenomas and Adenocarcinomas | 1672 |
Cystic, Mucinous and Serous Neoplasms | 487 |
Not Reported | 12 |
Complex Mixed and Stromal Neoplasms | 320 |
None | 1175 |
Epithelial Neoplasms, NOS | 230 |
Complex Epithelial Neoplasms | 27 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Cervix uteri | 915 |
Corpus uteri | 780 |
Uterus, NOS | 2000 |
Uterus | 867 |
Cervix | 307 |
researchsubject_identifier_system | count |
---|---|
GDC | 3591 |
PDC | 104 |
IDC | 1174 |
Refining Queries¶
Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:
Noadeno = Q('primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')
NoAdenoData = ALLDATA.AND(Noadeno)
NoAdenoData.researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.358 sec 3358 ms
files : 297379
total : 3197
primary_diagnosis_condition | count |
---|---|
Myomatous Neoplasms | 188 |
Squamous Cell Neoplasms | 609 |
Uterine Corpus Endometrial Carcinoma | 104 |
Cystic, Mucinous and Serous Neoplasms | 487 |
Not Reported | 12 |
Complex Mixed and Stromal Neoplasms | 320 |
None | 1175 |
Epithelial Neoplasms, NOS | 230 |
Complex Epithelial Neoplasms | 27 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Cervix uteri | 688 |
Uterus, NOS | 962 |
Corpus uteri | 373 |
Uterus | 867 |
Cervix | 307 |
researchsubject_identifier_system | count |
---|---|
GDC | 1919 |
PDC | 104 |
IDC | 1174 |
She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work. Since she's mostly interested in looking at the kinds of data available from each endpoint:
NoAdenoData.researchsubject.run().to_dataframe() # view the dataframe
Getting results from database
Total execution time: 0 min 3.308 sec 3308 ms
researchsubject_id | researchsubject_identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id |
---|---|---|---|---|---|
Loading... (need help?) |
ResearchSubject Field Definitions
A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs- id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.
- identifier: A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.
- identifier.system: The system or namespace that defines the identifier.
- identifier.value: The value of the identifier, as defined by the system.
- member_of_research_project: A reference to the Study(s) of which this ResearchSubject is a member.
- primary_diagnosis_condition: The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.
- primary_diagnosis_site: The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This categorization groups cases into general categories. This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.
- subject_id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. Can be joined to the `id` field from subject results
NoAdenoData.subject.run().to_dataframe() # view the dataframe
Getting results from database
Total execution time: 0 min 3.337 sec 3337 ms
subject_id | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death |
---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Subject Field Definitions
A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.- id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",STRING
- identifier: A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.
- identifier.system: The system or namespace that defines the identifier.
- identifier.value: The value of the identifier, as defined by the system.
- species: The taxonomic group (e.g. species) of the patient. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text. Ultimately, this will be a term returned by the vocabulary service.
- sex: The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.
- race: An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
- ethnicity: An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
- days_to_birth: Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.
- subject_associated_project: The list of Projects associated with the Subject.
- vital_status: Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.
- days_to_death: Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.
- cause_of_death: Coded value indicating the circumstance or condition that results in the death of the subject.
NoAdenoData.subject.file.run().to_dataframe() # view the dataframe
Getting results from database
Total execution time: 0 min 3.479 sec 3479 ms
file_id | file_identifier | label | data_category | data_type | file_format | file_associated_project | drs_uri | byte_size | checksum | data_modality | imaging_modality | dbgap_accession_number | imaging_series | subject_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
File Field Definitions
A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.- id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.
- identifier: A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.
- identifier.system: The system or namespace that defines the identifier.
- identifier.value: The value of the identifier, as defined by the system.
- label: Short name or abbreviation for dataset. Maps to rdfs:label.
- data_catagory: Broad categorization of the contents of the data file.
- data_type: Specific content type of the data file.
- file_format: Format of the data files.
- associated_project: A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.
- drs_uri: A string of characters used to identify a resource on the Data Repo Service(DRS). Can be used to retrieve this specific file from a server.
- byte_size: Size of the file in bytes. Maps to dcat:byteSize.
- checksum: The md5 value for the file. A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.
- data_modality: Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging".
- imaging_modality: An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.
- dbgap_accession_number: The dbgap accession number for the project.
Working with Results (pagination)¶
Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the paginator
function to get all the data from the subject and researchsubject endpoints into their own dataframes:
researchsubs = NoAdenoData.researchsubject.run()
rsdf = researchsubs.auto_paginator(to_df=True)
Getting results from database
Total execution time: 0 min 3.355 sec 3355 ms
subs = NoAdenoData.subject.run()
subsdf = subs.auto_paginator(to_df=True)
Getting results from database
Total execution time: 0 min 3.318 sec 3318 ms
rsdf # view the researchsubject dataframe
researchsubject_id | researchsubject_identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id | |
---|---|---|---|---|---|---|
Loading... (need help?) |
subsdf # view the subject dataframe
subject_id | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Merging Results across Endpoints¶
Then Julia uses the subject_id
and id
fields in each result to merge them together into one big dataset. She also specifies that any columns that are in both tables should be kept and have a suffix added to their name. This will help her to check that her merge worked correctly:
allmetadata = pd.merge(rsdf,
subsdf,
left_on="subject_id",
right_on='subject_id',
suffixes=("_rs", "_sub"))
allmetadata
researchsubject_id | researchsubject_identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
subject_id
from the research subject results seems to perfectly match the id data from the subject table, id_sub
. Julia then checks to see that her dataframe is the right size. She had 3197 researchsubject rows, so she expects 3197 rows here as well:
allmetadata.count()
0 | |
---|---|
Loading... (need help?) |
Satisfied with her results, Julia saves the data out to a csv so she can browse it with Excel:
allmetadata.to_csv("allmetadata.csv")
Julia knows from her subject count summary that there are more than 200,000 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria:
NoAdenoData.researchsubject.file.count.run()
Getting results from database
Total execution time: 0 min 3.528 sec 3528 ms
total : 297379
data_category | count |
---|---|
Imaging | 265303 |
Sequencing Reads | 4142 |
Peptide Spectral Matches | 1280 |
Simple Nucleotide Variation | 9987 |
Biospecimen | 2866 |
Processed Mass Spectra | 640 |
Clinical | 774 |
Copy Number Variation | 4079 |
Raw Mass Spectra | 640 |
Structural Variation | 2192 |
DNA Methylation | 1947 |
Transcriptome Profiling | 2820 |
Proteome Profiling | 339 |
Somatic Structural Variation | 370 |
data_type | count |
---|---|
Aggregated Somatic Mutation | 732 |
None | 265303 |
Masked Intensities | 1298 |
Masked Annotated Somatic Mutation | 1024 |
Biospecimen Supplement | 1755 |
Slide Image | 1111 |
Masked Copy Number Segment | 998 |
Text | 640 |
Proprietary | 640 |
Isoform Expression Quantification | 700 |
Clinical Supplement | 773 |
Transcript Fusion | 2278 |
Masked Somatic Mutation | 549 |
Open Standard | 1280 |
Raw Simple Somatic Mutation | 2811 |
Allele-specific Copy Number Segment | 495 |
Gene Level Copy Number | 637 |
Annotated Somatic Mutation | 4871 |
Splice Junction Quantification | 710 |
Copy Number Segment | 1140 |
miRNA Expression Quantification | 700 |
Aligned Reads | 4142 |
Gene Level Copy Number Scores | 809 |
Protein Expression Quantification | 339 |
Gene Expression Quantification | 710 |
Methylation Beta Value | 649 |
Structural Rearrangement | 284 |
Pathology Report | 1 |
file_format | count |
---|---|
DICOM | 265303 |
mzIdentML | 640 |
VCF | 5480 |
TXT | 4827 |
vendor-specific | 640 |
tsv | 640 |
MAF | 4649 |
TSV | 3980 |
IDAT | 1298 |
BAM | 4142 |
mzML | 640 |
BEDPE | 1504 |
BCR XML | 1217 |
BCR Biotab | 76 |
BCR SSF XML | 517 |
BCR OMF XML | 39 |
SVS | 1111 |
BCR Auxiliary XML | 474 |
BCR PPS XML | 193 |
XLSX | 7 |
1 | |
CDC JSON | 1 |
file_identifier_system | count |
---|---|
IDC | 265303 |
GDC | 29516 |
PDC | 2560 |
Julia decides that a good place to start would be with Slide Images. There's only 1111, so she should be able to quickly scan through them over the next few days and see if they will be useful. So she adds one more filter on her search:
JustSlides = Q('data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()
Getting results from database
Total execution time: 0 min 3.342 sec 3342 ms
total : 1111
data_category | count |
---|---|
Biospecimen | 1111 |
data_type | count |
---|---|
Slide Image | 1111 |
file_format | count |
---|---|
SVS | 1111 |
file_identifier_system | count |
---|---|
GDC | 1111 |
Finally, Julia uses the pagenation function again to get all the slide files, and merges her metadata file with this file information. This way she will be able to review what phenotypes each slide is associated with:
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = slides.auto_paginator(to_df=True)
Getting results from database
Total execution time: 0 min 3.255 sec 3255 ms
slidemetadata = pd.merge(slidesdf,
allmetadata,
left_on=("subject_id","researchsubject_id"),
right_on=("subject_id", "researchsubject_id"),
suffixes=("_slide", "_all"))
slidemetadata
file_id | file_identifier | label | data_category | data_type | file_format | file_associated_project | drs_uri | byte_size | checksum | data_modality | imaging_modality | dbgap_accession_number | imaging_series | researchsubject_id | subject_id | researchsubject_identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
slidemetadata.count()
0 | |
---|---|
Loading... (need help?) |
Julia saves this dataframe to a csv as well, and now she has all the information she needs to begin work on her project. She can use the drs_id
column information to directly download the images she is interested in using a DRS resolver, or she can input the DRS IDs at a cloud workspace such as Terra or the Cancer Genomics Cloud to view the images online. In either case, she has all the metadata she needs to get started, and can save this notebook of her work in case she'd like to come back and modify her search.
slidemetadata.to_csv("slidemetadata.csv")