Cohort Building¶

Example use case:

alt_text Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.

Getting Started¶

The CDA provides a custom python tool for searching CDA data. Q (short for Query) offers several ways to search and filter data, and several input modes:

Q.() builds a query that can be used by run() or count()
Q.run() returns data for the specified search
Q.count() returns summary information (counts) data that fit the specified search
columns() returns entity field names
unique_terms() returns entity field contents

Before Julia does any work, she needs to import these functions cdapython. She'll also need to import pandas to work with dataframes and itables to display them nicely. The opt. settings are pre-configuring how itables should display her tables, with scrolling and paging enabled.

In [1]:

            
                Copied!
                
                    
                    
                
                

        
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())

2022.12.21

CDA data comes from three sources:

The Proteomic Data Commons (PDC)
The Genomic Data Commons (GDC)
The Imaging Data Commons (IDC)

The CDA makes this data searchable in five main endpoints:

subject: A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.
researchsubject: A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs
specimen: Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.
file: A unit of data about subjects, researchsubjects, specimens, or their associated information
mutation: Molecular data about specific mutations, currently limited to the TCGA-READ project from GDC.

and two endpoints that offer deeper information about data in the researchsubject endpoint:

diagnosis: A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.
treatment: Represent medication administration or other treatment types.

Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.

Finding Search Terms¶

Accordingly, to see what search fields are available, Julia starts by using the command columns:

In [2]:

            
                Copied!
                
columns().to_dataframe()
columns().to_dataframe()

Out[2]:

fieldName	endpoint	description	type	mode
Loading... (need help?)

There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she searches the description column for "diagnosis":

In [3]:

            
                Copied!
                
columns().to_dataframe(search_fields=["description", "fieldName"],search_value="diagnosis")
columns().to_dataframe(search_fields=["description", "fieldName"],search_value="diagnosis")

Out[3]:

	fieldName	endpoint	description	type	mode
Loading... (need help?)

To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retrieving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.

Since Julia is interested specifically in uterine cancers, she looks for columns that appear to have anatomical data, and then uses the unique_terms function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears, she uses the show_counts flag so she can also see how much data is associated with each term:

In [4]:

            
                Copied!
                
unique_terms("treatment_anatomic_site", show_counts = True).to_dataframe()
unique_terms("treatment_anatomic_site", show_counts = True).to_dataframe()

Out[4]:

treatment_anatomic_site	Count
Loading... (need help?)

In [5]:

            
                Copied!
                
unique_terms("primary_diagnosis_site", show_counts = True).to_dataframe()
unique_terms("primary_diagnosis_site", show_counts = True).to_dataframe()

Out[5]:

primary_diagnosis_site	Count
Loading... (need help?)

CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.

Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:

In [6]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="uter")
unique_terms("primary_diagnosis_site").to_list(filters="uter")

Out[6]:

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

Just to be sure, Julia also searches for any other instances of "cervix":

In [7]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="cerv")
unique_terms("primary_diagnosis_site").to_list(filters="cerv")

Out[7]:

['Cervix', 'Cervix uteri']

Building a Query¶

With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of Q statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the = operator to get only exact matches:

In [8]:

            
                Copied!
                
Tsite = Q('treatment_anatomic_site = "Cervix"')
Tsite = Q('treatment_anatomic_site = "Cervix"')

However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, Q also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big Q statement to grab everything that is either 'uter' or 'cerv':

In [9]:

            
                Copied!
                
Dsite = Q('primary_diagnosis_site = "%uter%" OR primary_diagnosis_site = "%cerv%"')
Dsite = Q('primary_diagnosis_site = "%uter%" OR primary_diagnosis_site = "%cerv%"')

Finally, Julia adds her two queries together into one large one:

In [10]:

            
                Copied!
                
ALLDATA = Tsite.OR(Dsite)
ALLDATA = Tsite.OR(Dsite)

Looking at Summary Data¶

Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using count:

In [11]:

            
                Copied!
                
ALLDATA.count.run()
ALLDATA.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.281 sec 3281 ms

specimen_count : 41069

treatment_count : 3049

diagnosis_count : 3685

mutation_count : 903

researchsubject_count : 4869

subject_count : 3742

Out[11]:

It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using count:

In [12]:

            
                Copied!
                
ALLDATA.researchsubject.count.run()
ALLDATA.researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.321 sec 3321 ms

   files : 322958

    total : 4869

WARNING:matplotlib.font_manager:Matplotlib is building the font cache; this may take a moment.

primary_diagnosis_condition	count
Myomatous Neoplasms	188
Squamous Cell Neoplasms	609
Uterine Corpus Endometrial Carcinoma	104
Adenomas and Adenocarcinomas	1672
Cystic, Mucinous and Serous Neoplasms	487
Not Reported	12
Complex Mixed and Stromal Neoplasms	320
None	1175
Epithelial Neoplasms, NOS	230
Complex Epithelial Neoplasms	27
Soft Tissue Tumors and Sarcomas, NOS	14
Neoplasms, NOS	12
Trophoblastic neoplasms	13
Mesonephromas	5
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Cervix uteri	915
Corpus uteri	780
Uterus, NOS	2000
Uterus	867
Cervix	307

researchsubject_identifier_system	count
GDC	3591
PDC	104
IDC	1174

Out[12]:

Refining Queries¶

Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:

In [13]:

            
                Copied!
                
Noadeno = Q('primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()
Noadeno = Q('primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.358 sec 3358 ms

   files : 297379

    total : 3197

primary_diagnosis_condition	count
Myomatous Neoplasms	188
Squamous Cell Neoplasms	609
Uterine Corpus Endometrial Carcinoma	104
Cystic, Mucinous and Serous Neoplasms	487
Not Reported	12
Complex Mixed and Stromal Neoplasms	320
None	1175
Epithelial Neoplasms, NOS	230
Complex Epithelial Neoplasms	27
Soft Tissue Tumors and Sarcomas, NOS	14
Neoplasms, NOS	12
Trophoblastic neoplasms	13
Mesonephromas	5
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Cervix uteri	688
Uterus, NOS	962
Corpus uteri	373
Uterus	867
Cervix	307

researchsubject_identifier_system	count
GDC	1919
PDC	104
IDC	1174

Out[13]:

She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work. Since she's mostly interested in looking at the kinds of data available from each endpoint:

In [14]:

            
                Copied!
                
NoAdenoData.researchsubject.run().to_dataframe() # view the dataframe
NoAdenoData.researchsubject.run().to_dataframe() # view the dataframe

Getting results from database

                        Total execution time: 0
                        min 3.308 sec 3308 ms

Out[14]:

researchsubject_id	researchsubject_identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
Loading... (need help?)

ResearchSubject Field Definitions

A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs

id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.
identifier: A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.
identifier.system: The system or namespace that defines the identifier.
identifier.value: The value of the identifier, as defined by the system.
member_of_research_project: A reference to the Study(s) of which this ResearchSubject is a member.
primary_diagnosis_condition: The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.
primary_diagnosis_site: The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This categorization groups cases into general categories. This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.
subject_id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. Can be joined to the `id` field from subject results

In [15]:

            
                Copied!
                
NoAdenoData.subject.run().to_dataframe() # view the dataframe
NoAdenoData.subject.run().to_dataframe() # view the dataframe

Getting results from database

                        Total execution time: 0
                        min 3.337 sec 3337 ms

Out[15]:

subject_id	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

Subject Field Definitions

A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.

id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",STRING
identifier: A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.
identifier.system: The system or namespace that defines the identifier.
identifier.value: The value of the identifier, as defined by the system.
species: The taxonomic group (e.g. species) of the patient. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text. Ultimately, this will be a term returned by the vocabulary service.
sex: The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.
race: An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
ethnicity: An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
days_to_birth: Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.
subject_associated_project: The list of Projects associated with the Subject.
vital_status: Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.
days_to_death: Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.
cause_of_death: Coded value indicating the circumstance or condition that results in the death of the subject.

In [16]:

            
                Copied!
                
NoAdenoData.subject.file.run().to_dataframe() # view the dataframe
NoAdenoData.subject.file.run().to_dataframe() # view the dataframe

Getting results from database

                        Total execution time: 0
                        min 3.479 sec 3479 ms

Out[16]:

file_id	file_identifier	label	data_category	data_type	file_format	file_associated_project	drs_uri	byte_size	checksum	data_modality	imaging_modality	dbgap_accession_number	imaging_series	subject_id
Loading... (need help?)

File Field Definitions

A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.

id: The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.
identifier: A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.
identifier.system: The system or namespace that defines the identifier.
identifier.value: The value of the identifier, as defined by the system.
label: Short name or abbreviation for dataset. Maps to rdfs:label.
data_catagory: Broad categorization of the contents of the data file.
data_type: Specific content type of the data file.
file_format: Format of the data files.
associated_project: A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.
drs_uri: A string of characters used to identify a resource on the Data Repo Service(DRS). Can be used to retrieve this specific file from a server.
byte_size: Size of the file in bytes. Maps to dcat:byteSize.
checksum: The md5 value for the file. A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.
data_modality: Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging".
imaging_modality: An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.
dbgap_accession_number: The dbgap accession number for the project.

Working with Results (pagination)¶

Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the paginator function to get all the data from the subject and researchsubject endpoints into their own dataframes:

In [17]:

            
                Copied!
                
researchsubs = NoAdenoData.researchsubject.run()
rsdf = researchsubs.auto_paginator(to_df=True)
researchsubs = NoAdenoData.researchsubject.run()
rsdf = researchsubs.auto_paginator(to_df=True)

Getting results from database

                        Total execution time: 0
                        min 3.355 sec 3355 ms

In [18]:

            
                Copied!
                
subs = NoAdenoData.subject.run()
subsdf = subs.auto_paginator(to_df=True)
subs = NoAdenoData.subject.run()
subsdf = subs.auto_paginator(to_df=True)

Getting results from database

                        Total execution time: 0
                        min 3.318 sec 3318 ms

In [19]:

            
                Copied!
                
rsdf # view the researchsubject dataframe
rsdf # view the researchsubject dataframe

Out[19]:

	researchsubject_id	researchsubject_identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
Loading... (need help?)

In [20]:

            
                Copied!
                
subsdf # view the subject dataframe
subsdf # view the subject dataframe

Out[20]:

	subject_id	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

Merging Results across Endpoints¶

Then Julia uses the subject_id and id fields in each result to merge them together into one big dataset. She also specifies that any columns that are in both tables should be kept and have a suffix added to their name. This will help her to check that her merge worked correctly:

In [21]:

            
                Copied!
                
allmetadata = pd.merge(rsdf,
                subsdf,
                left_on="subject_id",
                right_on='subject_id',
                suffixes=("_rs", "_sub"))

allmetadata
allmetadata = pd.merge(rsdf,
                subsdf,
                left_on="subject_id",
                right_on='subject_id',
                suffixes=("_rs", "_sub"))

allmetadata

Out[21]:

	researchsubject_id	researchsubject_identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

subject_id from the research subject results seems to perfectly match the id data from the subject table, id_sub. Julia then checks to see that her dataframe is the right size. She had 3197 researchsubject rows, so she expects 3197 rows here as well:

In [22]:

            
                Copied!
                
allmetadata.count()
allmetadata.count()

Out[22]:

	0
Loading... (need help?)

Satisfied with her results, Julia saves the data out to a csv so she can browse it with Excel:

In [23]:

            
                Copied!
                
allmetadata.to_csv("allmetadata.csv")
allmetadata.to_csv("allmetadata.csv")

Julia knows from her subject count summary that there are more than 200,000 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria:

In [24]:

            
                Copied!
                
NoAdenoData.researchsubject.file.count.run()
NoAdenoData.researchsubject.file.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.528 sec 3528 ms

   total : 297379

data_category	count
Imaging	265303
Sequencing Reads	4142
Peptide Spectral Matches	1280
Simple Nucleotide Variation	9987
Biospecimen	2866
Processed Mass Spectra	640
Clinical	774
Copy Number Variation	4079
Raw Mass Spectra	640
Structural Variation	2192
DNA Methylation	1947
Transcriptome Profiling	2820
Proteome Profiling	339
Somatic Structural Variation	370

data_type	count
Aggregated Somatic Mutation	732
None	265303
Masked Intensities	1298
Masked Annotated Somatic Mutation	1024
Biospecimen Supplement	1755
Slide Image	1111
Masked Copy Number Segment	998
Text	640
Proprietary	640
Isoform Expression Quantification	700
Clinical Supplement	773
Transcript Fusion	2278
Masked Somatic Mutation	549
Open Standard	1280
Raw Simple Somatic Mutation	2811
Allele-specific Copy Number Segment	495
Gene Level Copy Number	637
Annotated Somatic Mutation	4871
Splice Junction Quantification	710
Copy Number Segment	1140
miRNA Expression Quantification	700
Aligned Reads	4142
Gene Level Copy Number Scores	809
Protein Expression Quantification	339
Gene Expression Quantification	710
Methylation Beta Value	649
Structural Rearrangement	284
Pathology Report	1

file_format	count
DICOM	265303
mzIdentML	640
VCF	5480
TXT	4827
vendor-specific	640
tsv	640
MAF	4649
TSV	3980
IDAT	1298
BAM	4142
mzML	640
BEDPE	1504
BCR XML	1217
BCR Biotab	76
BCR SSF XML	517
BCR OMF XML	39
SVS	1111
BCR Auxiliary XML	474
BCR PPS XML	193
XLSX	7
PDF	1
CDC JSON	1

file_identifier_system	count
IDC	265303
GDC	29516
PDC	2560

Out[24]:

Julia decides that a good place to start would be with Slide Images. There's only 1111, so she should be able to quickly scan through them over the next few days and see if they will be useful. So she adds one more filter on her search:

In [25]:

            
                Copied!
                
JustSlides = Q('data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()
JustSlides = Q('data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.342 sec 3342 ms

    total : 1111

data_category	count
Biospecimen	1111

data_type	count
Slide Image	1111

file_format	count
SVS	1111

file_identifier_system	count
GDC	1111

Out[25]:

Finally, Julia uses the pagenation function again to get all the slide files, and merges her metadata file with this file information. This way she will be able to review what phenotypes each slide is associated with:

In [26]:

            
                Copied!
                
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = slides.auto_paginator(to_df=True)
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = slides.auto_paginator(to_df=True)

Getting results from database

                        Total execution time: 0
                        min 3.255 sec 3255 ms

In [27]:

            
                Copied!
                
                    
                    
                
                

        
slidemetadata = pd.merge(slidesdf, 
                         allmetadata, 
                         left_on=("subject_id","researchsubject_id"),
                         right_on=("subject_id", "researchsubject_id"),
                         suffixes=("_slide", "_all"))
slidemetadata
slidemetadata = pd.merge(slidesdf, 
                         allmetadata, 
                         left_on=("subject_id","researchsubject_id"),
                         right_on=("subject_id", "researchsubject_id"),
                         suffixes=("_slide", "_all"))
slidemetadata

Out[27]:

	file_id	file_identifier	label	data_category	data_type	file_format	file_associated_project	drs_uri	byte_size	checksum	data_modality	imaging_modality	dbgap_accession_number	imaging_series	researchsubject_id	subject_id	researchsubject_identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

In [28]:

            
                Copied!
                
slidemetadata.count()
slidemetadata.count()

Out[28]:

	0
Loading... (need help?)

Julia saves this dataframe to a csv as well, and now she has all the information she needs to begin work on her project. She can use the drs_id column information to directly download the images she is interested in using a DRS resolver, or she can input the DRS IDs at a cloud workspace such as Terra or the Cancer Genomics Cloud to view the images online. In either case, she has all the metadata she needs to get started, and can save this notebook of her work in case she'd like to come back and modify her search.

In [29]:

            
                Copied!
                
slidemetadata.to_csv("slidemetadata.csv")
slidemetadata.to_csv("slidemetadata.csv")

Last update: 2022-11-03