Using Operators¶

Operators allow us to make more complex queries by adding, subtracting, or filtering data.

Q uses the following operators:

= : Equals
!= : Not Equal
OR
AND
FROM
IN and NOT IN
%: pattern matching a wildcard
IS and IS NOT
>, >, >=, <=: Greater and Less than
ORDER_BY

We use these operators to build more and more complex Q statements before sending our query to run() or count().

The CDA provides a custom python tool for searching CDA data. Q (short for Query) offers several ways to search and filter data, and several input modes:

Q.() builds a query that can be used by run() or count()
Q.run() returns data for the specified search
Q.count() returns summary information (counts) data that fit the specified search
columns() returns entity field names
unique_terms() returns entity field contents

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt. settings are pre-configuring how itables should display our tables, with scrolling and paging enabled. Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:

            
                Copied!
                
                    
                    
                
                

        
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())

2022.12.21

Equals: `=`¶

In the other tutorials, we have always used the same query, which uses the = operator.

Q('primary_diagnosis_site = "brain"')

This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:

In [2]:

            
                Copied!
                
Q('primary_diagnosis_site = "uterus"').researchsubject.count.run()
Q('primary_diagnosis_site = "uterus"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.241 sec 3241 ms

   files : 243236

    total : 867

primary_diagnosis_condition	count
None	867

primary_diagnosis_site	count
Uterus	867

researchsubject_identifier_system	count
IDC	867

Out[2]:

Not Equal: `!=`¶

The != operator does the opposite of the = operator, it returns everything that is not exactly the term you give it:

In [3]:

            
                Copied!
                
Q('primary_diagnosis_site != "uterus"').researchsubject.count.run()
Q('primary_diagnosis_site != "uterus"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.317 sec 3317 ms

  files : 49187062

   total : 154087

primary_diagnosis_condition	count
Paragangliomas and Glomus Tumors	241
Adenomas and Adenocarcinomas	32975
Squamous Cell Neoplasms	5077
Neuroepitheliomatous Neoplasms	1332
Ductal and Lobular Neoplasms	7894
None	65251
Nevi and Melanomas	3158
Transitional Cell Papillomas and Carcinomas	1885
Gliomas	4775
Lipomatous Neoplasms	343
Breast Invasive Carcinoma	316
Lymphoid Leukemias	2072
Other	212
Fibromatous Neoplasms	322
Plasma Cell Tumors	1066
Mature B-Cell Lymphomas	1019
Myomatous Neoplasms	633
Myeloid Leukemias	4545
Adnexal and Skin Appendage Neoplasms	58
Glioblastoma	100
Epithelial Neoplasms, NOS	5696
Cystic, Mucinous and Serous Neoplasms	3723
Acute Myeloid Leukemia	41
Colon Adenocarcinoma	164
Lung Squamous Cell Carcinoma	118
Pancreatic Ductal Adenocarcinoma	144
Not Reported	271
Specialized Gonadal Neoplasms	124
Complex Mixed and Stromal Neoplasms	1827
Clear Cell Renal Cell Carcinoma	116
Lung Adenocarcinoma	216
Osseous and Chondromatous Neoplasms	615
Germ Cell Neoplasms	703
Uterine Corpus Endometrial Carcinoma	104
Thymic Epithelial Neoplasms	262
Ovarian Serous Cystadenocarcinoma	283
Soft Tissue Tumors and Sarcomas, NOS	315
Oral Squamous Cell Carcinoma	38
Leukemias, NOS	123
Neoplasms, NOS	1359
Complex Epithelial Neoplasms	254
Meningiomas	289
Acinar Cell Neoplasms	300
Mesothelial Neoplasms	647
Hepatocellular Carcinoma	170
Nerve Sheath Tumors	115
Not Applicable	441
Pediatric/AYA Brain Tumors	199
Blood Vessel Tumors	156
Head and Neck Squamous Cell Carcinoma	110
Early Onset Gastric Cancer	80
Unknown	63
Rectum Adenocarcinoma	30
Chronic Myeloproliferative Disorders	477
Myelodysplastic Syndromes	389
Miscellaneous Tumors	89
Miscellaneous Bone Tumors	130
Trophoblastic neoplasms	21
Neoplasms of Histiocytes and Accessory Lymphoid Cells	66
Synovial-like Neoplasms	98
Basal Cell Neoplasms	45
Hodgkin Lymphoma	11
Mucoepidermoid Neoplasms	60
Precursor Cell Lymphoblastic Lymphoma	12
Papillary Renal Cell Carcinoma	2
Mature T- and NK-Cell Lymphomas	94
Malignant Lymphomas, NOS or Diffuse	42
Other Leukemias	68
Mesonephromas	5
Other Hematologic Disorders	20
Mast Cell Tumors	10
Granular Cell Tumors and Alveolar Soft Part Sarcomas	23
Fibroepithelial Neoplasms	25
Myxomatous Neoplasms	18
Giant Cell Tumors	3
Immunoproliferative Diseases	4
Chromophobe Renal Cell Carcinoma	1
Odontogenic Tumors	3
Lymphatic Vessel Tumors	1

primary_diagnosis_site	count
Thyroid gland	1880
Uterus, NOS	2000
Not Reported	512
Brain	4347
Kidney	4954
Stomach	1876
Breast	24108
Small intestine	269
Colon	8575
Head and Neck	148
Liver and intrahepatic bile ducts	1609
Hematopoietic and reticuloendothelial systems	9635
Corpus uteri	780
Bronchus and lung	12385
Pancreas	3399
Head-Neck	3053
Larynx	169
Skin	3517
Ovary	4349
Other and ill-defined sites	1189
Esophagus	1588
Bladder	2156
Other and ill-defined sites in lip, oral cavity and pharynx	361
Cervix uteri	915
Other and unspecified parts of tongue	133
Prostate gland	2354
Chest	28221
Lung	4848
Adrenal gland	851
Retroperitoneum and peritoneum	384
Other and unspecified parts of biliary tract	226
Lymph nodes	538
Nasopharynx	101
Meninges	243
Eye and adnexa	222
Rectum	1310
Oropharynx	194
Bones, joints and articular cartilage of limbs	268
Peripheral nerves and autonomic nervous system	418
Other and unspecified female genital organs	161
Other and unspecified major salivary glands	615
Unknown	3233
Abdomen	92
Spinal cord, cranial nerves, and other parts of central nervous system	3703
Other and ill-defined digestive organs	720
Other and unspecified parts of mouth	43
Connective, subcutaneous and other soft tissues	1574
Testis	542
Thymus	431
Rectosigmoid junction	80
Liver	667
Heart, mediastinum, and pleura	706
Floor of mouth	56
Gallbladder	265
Gum	11
Abdomen, Mediastinum	176
Various (11 locations)	89
Marrow, Blood	89
Head	105
Prostate	2139
Various	449
Pelvis, Prostate, Anus	58
Phantom	33
Extremities	51
Abdomen, Pelvis	230
Testicles	150
Chest-Abdomen-Pelvis, Leg, TSpine	261
Bile Duct	51
Cervix	307
Thyroid	507
Adrenal Glands	271
Mesothelium	87
Intraocular	80
Ear	242
Tonsil	46
Bones, joints and articular cartilage of other and unspecified sites	455
Penis	33
Base of tongue	24
Vagina	72
Anus and anal canal	235
Hypopharynx	25
Nasal cavity and middle ear	40
None	341
Other and unspecified urinary organs	217
Other endocrine glands and related structures	181
Trachea	7
Pancreas	1
Lung Phantom	8
Lip	9
Ureter	15
Vulva	10
Renal pelvis	1
Other and ill-defined sites within respiratory system and intrathoracic organs	2
Palate	5
Other and unspecified male genital organs	1

researchsubject_identifier_system	count
GDC	86394
PDC	2444
IDC	65249

Out[3]:

Note that in our != results, there are 2000 "Uterus, NOS" samples. These don't appear in our = search because "Uterus, NOS" is not exactly "Uterus".

There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.

OR¶

If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results. In an OR query, each data point only needs to meet a single piece of criteria to be returned, this makes OR good for early, broad searches. It increases the amount of data returned.

OR can be used both inside a Q statement:

In [4]:

            
                Copied!
                
Q('primary_diagnosis_site = "uterus" OR primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()
Q('primary_diagnosis_site = "uterus" OR primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.272 sec 3272 ms

   files : 258365

    total : 2867

primary_diagnosis_condition	count
Cystic, Mucinous and Serous Neoplasms	313
Complex Mixed and Stromal Neoplasms	294
Uterine Corpus Endometrial Carcinoma	104
Adenomas and Adenocarcinomas	1038
None	867
Myomatous Neoplasms	184
Epithelial Neoplasms, NOS	20
Trophoblastic neoplasms	13
Soft Tissue Tumors and Sarcomas, NOS	14
Not Reported	12
Mesonephromas	2
Complex Epithelial Neoplasms	2
Neoplasms, NOS	3
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Uterus, NOS	2000
Uterus	867

researchsubject_identifier_system	count
PDC	104
GDC	1896
IDC	867

Out[4]:

and to combine 2 or more Q statements:

In [5]:

            
                Copied!
                
Query1 = Q('primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.OR(Query2).researchsubject.count.run()
Query1 = Q('primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.OR(Query2).researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.254 sec 3254 ms

   files : 15129

    total : 2000

primary_diagnosis_condition	count
Myomatous Neoplasms	184
Uterine Corpus Endometrial Carcinoma	104
Cystic, Mucinous and Serous Neoplasms	313
Adenomas and Adenocarcinomas	1038
Complex Mixed and Stromal Neoplasms	294
Soft Tissue Tumors and Sarcomas, NOS	14
Neoplasms, NOS	3
Trophoblastic neoplasms	13
Epithelial Neoplasms, NOS	20
Not Reported	12
Mesonephromas	2
Complex Epithelial Neoplasms	2
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Uterus, NOS	2000

researchsubject_identifier_system	count
GDC	1896
PDC	104

Out[5]:

For each OR you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the OR operator is flexible enough to run searches across columns, or even across endpoints.

AND¶

Like OR, AND can be used both inside a Q statement, and to join multiple Q statements. AND requires that both statements be true simultaneously for each returned bit of data. This makes AND good for filtering down results. It decreases the amount of data returned.

For searches where you are interested in subsetting multiple columns, AND can help you to quickly filter to only the set you want. Note that AND can be used inside a Q statement, or to add multiple Q statements together:

In [6]:

            
                Copied!
                
Q('primary_diagnosis_site = "uterus, NOS" AND primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()
Q('primary_diagnosis_site = "uterus, NOS" AND primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.339 sec 3339 ms

    files : 2560

    total : 104

primary_diagnosis_condition	count
Uterine Corpus Endometrial Carcinoma	104

primary_diagnosis_site	count
Uterus, NOS	104

researchsubject_identifier_system	count
PDC	104

Out[6]:

In [7]:

            
                Copied!
                
Query1 = Q('primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.AND(Query2).researchsubject.count.run()
Query1 = Q('primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.AND(Query2).researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.429 sec 3429 ms

    files : 2560

    total : 104

primary_diagnosis_condition	count
Uterine Corpus Endometrial Carcinoma	104

primary_diagnosis_site	count
Uterus, NOS	104

researchsubject_identifier_system	count
PDC	104

Out[7]:

Note that in those examples, we searched for one term in 'primary_diagnosis_site' and the other in 'primary_diagnosis_condition'. AND always filters, so in most cases using AND to search two different values in the same column will return no results. For example, primary_diagnosis_site can have only one value, so it can never be both "uterus" and "uterus, NOS":

In [8]:

            
                Copied!
                
Q('primary_diagnosis_site = "uterus" AND primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()
Q('primary_diagnosis_site = "uterus" AND primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.267 sec 3267 ms

     files : 0

     total : 0

Out[8]:

`FROM`¶

There are some instances where asking for multiple values in the same column is a valid question: columns where a single row can have more than one value in the same field. A good example of this is 'subject_identifier_system', which is the field in the subject table that lists all of the sources for data for that subject:

In [9]:

            
                Copied!
                
Q('primary_diagnosis_site = "uterus"').subject.run(limit=3).to_dataframe()
Q('primary_diagnosis_site = "uterus"').subject.run(limit=3).to_dataframe()

Getting results from database

                        Total execution time: 0
                        min 3.23 sec 3230 ms

Out[9]:

subject_id	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

To ask for all the subjects that have 'primary_diagnosis_site = "uterus"' AND that have data from GDC, PDC, and IDC we use the special operator FROM, which lets us make a query, and then search within the answer. This multi-step query allows us to search for multiple values in the same column.

For instance, to get 'primary_diagnosis_site = "uterus"' from subjects who have data at GDC and PDC we would start by building a query for uterus data from GDC, and save it to a variable:

In [10]:

            
                Copied!
                
firstquery = Q('primary_diagnosis_site = "uterus" FROM subject_identifier_system = "GDC"')
firstquery = Q('primary_diagnosis_site = "uterus" FROM subject_identifier_system = "GDC"')

We can check to see what those results look like overall by running the query:

In [11]:

            
                Copied!
                
firstquery.subject.count.run()
firstquery.subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.289 sec 3289 ms

   files : 157369

    total : 717

sex	count
female	705
None	12

race	count
white	476
black or african american	121
not reported	71
american indian or alaska native	4
asian	24
None	12
native hawaiian or other pacific islander	9

ethnicity	count
not hispanic or latino	461
hispanic or latino	20
not reported	224
None	12

cause_of_death	count
Not Reported	96
None	617
Cancer Related	3
Not Cancer Related	1

subject_identifier_system	count
IDC	717
PDC	100
GDC	717

Out[11]:

That looks fine, so lets add a query that looks within that result for PDC:

In [12]:

            
                Copied!
                
secondquery = firstquery.FROM('subject_identifier_system = "PDC"')
secondquery = firstquery.FROM('subject_identifier_system = "PDC"')

and finally run that query on the table of our choice:

In [13]:

            
                Copied!
                
secondquery.subject.count.run()
secondquery.subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.359 sec 3359 ms

   files : 47447

    total : 100

sex	count
female	100

race	count
white	58
not reported	38
black or african american	3
asian	1

ethnicity	count
not hispanic or latino	41
hispanic or latino	4
not reported	55

cause_of_death	count
Cancer Related	3
Not Reported	96
Not Cancer Related	1

subject_identifier_system	count
PDC	100
GDC	100
IDC	100

Out[13]:

To further filter the results to only subjects with data in GDC, PDC, and IDC, we would add a third query for IDC and run it:

In [14]:

            
                Copied!
                
secondquery.FROM('subject_identifier_system = "IDC"').subject.count.run()
secondquery.FROM('subject_identifier_system = "IDC"').subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.305 sec 3305 ms

   files : 47447

    total : 100

sex	count
female	100

race	count
white	58
not reported	38
asian	1
black or african american	3

ethnicity	count
not hispanic or latino	41
hispanic or latino	4
not reported	55

cause_of_death	count
Not Reported	96
Cancer Related	3
Not Cancer Related	1

subject_identifier_system	count
GDC	100
PDC	100
IDC	100

Out[14]:

You can also do all of these steps in a single call if you prefer (the triple quotes allow us to make the query on multiple lines that are easier to read)

In [15]:

            
                Copied!
                
Q("""primary_diagnosis_site = 'uterus' 
    FROM subject_identifier_system = 'GDC' 
    FROM subject_identifier_system = 'PDC' 
    FROM subject_identifier_system = 'IDC'""").subject.count.run()
Q("""primary_diagnosis_site = 'uterus' 
    FROM subject_identifier_system = 'GDC' 
    FROM subject_identifier_system = 'PDC' 
    FROM subject_identifier_system = 'IDC'""").subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.37 sec 3370 ms

   files : 47447

    total : 100

sex	count
female	100

race	count
white	58
not reported	38
asian	1
black or african american	3

ethnicity	count
not hispanic or latino	41
hispanic or latino	4
not reported	55

cause_of_death	count
Not Reported	96
Cancer Related	3
Not Cancer Related	1

subject_identifier_system	count
GDC	100
PDC	100
IDC	100

Out[15]:

Remember that just because we're searching a subject field, doesn't mean we have to search the subject table. We could instead find all of the researchsubjects for primary_diagnosis_site = "uterus" that are associated with a subject who has data from all 3 data sources:

In [16]:

            
                Copied!
                
secondquery.FROM('subject_identifier_system = "IDC"').researchsubject.run()
secondquery.FROM('subject_identifier_system = "IDC"').researchsubject.run()

Getting results from database

                        Total execution time: 0
                        min 3.306 sec 3306 ms

Out[16]:

            
            Offset: 0
            Count: 100
            Total Row Count: 100
            More pages: True

`IN` and `NOT IN`¶

For instances where you have many search terms, it may be easier (and more readable) to use IN. With IN you make a list of all the terms you are interested in, and ask whether they are IN a given field:

In [17]:

            
                Copied!
                
Q('primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()
Q('primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.279 sec 3279 ms

   files : 299897

    total : 4089

primary_diagnosis_condition	count
Squamous Cell Neoplasms	609
Adenomas and Adenocarcinomas	1265
Cystic, Mucinous and Serous Neoplasms	348
Complex Mixed and Stromal Neoplasms	294
Uterine Corpus Endometrial Carcinoma	104
Myomatous Neoplasms	184
None	1175
Not Reported	12
Complex Epithelial Neoplasms	27
Epithelial Neoplasms, NOS	26
Neoplasms, NOS	12
Trophoblastic neoplasms	13
Soft Tissue Tumors and Sarcomas, NOS	14
Neuroepitheliomatous Neoplasms	1
Mesonephromas	5

primary_diagnosis_site	count
Cervix uteri	915
Uterus, NOS	2000
Uterus	867
Cervix	307

researchsubject_identifier_system	count
GDC	2811
PDC	104
IDC	1174

Out[17]:

The equivalent request without IN would require a large number of OR statements. (The triple quotes surrounding this example are to allow a multi-line Q statement):

Q("""primary_diagnosis_site = "uterus, NOS" OR 
     primary_diagnosis_site = "uterus" OR 
     primary_diagnosis_site = "Cervix" OR 
     primary_diagnosis_site = "Cervix uteri" """).researchsubject.count.run()

NOT IN is the opposite of IN, and so gives the inverse results. If we add NOT to our above query, we get all the researchsubjects who's primary_diagnosis_site was not in our list:

In [18]:

            
                Copied!
                
Q('primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()
Q('primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.282 sec 3282 ms

  files : 49133001

   total : 150865

primary_diagnosis_condition	count
Adenomas and Adenocarcinomas	31710
Epithelial Neoplasms, NOS	5670
Ductal and Lobular Neoplasms	7894
Squamous Cell Neoplasms	4468
Nevi and Melanomas	3158
Plasma Cell Tumors	1066
Transitional Cell Papillomas and Carcinomas	1885
None	64943
Other	212
Neoplasms, NOS	1347
Neuroepitheliomatous Neoplasms	1331
Cystic, Mucinous and Serous Neoplasms	3375
Myeloid Leukemias	4545
Gliomas	4775
Myomatous Neoplasms	449
Pancreatic Ductal Adenocarcinoma	144
Breast Invasive Carcinoma	316
Germ Cell Neoplasms	703
Clear Cell Renal Cell Carcinoma	116
Colon Adenocarcinoma	164
Ovarian Serous Cystadenocarcinoma	283
Glioblastoma	100
Soft Tissue Tumors and Sarcomas, NOS	301
Lymphoid Leukemias	2072
Lung Squamous Cell Carcinoma	118
Lung Adenocarcinoma	216
Acinar Cell Neoplasms	300
Mature B-Cell Lymphomas	1019
Not Reported	259
Paragangliomas and Glomus Tumors	241
Fibromatous Neoplasms	322
Complex Mixed and Stromal Neoplasms	1533
Thymic Epithelial Neoplasms	262
Lipomatous Neoplasms	343
Acute Myeloid Leukemia	41
Complex Epithelial Neoplasms	227
Hepatocellular Carcinoma	170
Mesothelial Neoplasms	647
Leukemias, NOS	123
Myelodysplastic Syndromes	389
Synovial-like Neoplasms	98
Oral Squamous Cell Carcinoma	38
Meningiomas	289
Osseous and Chondromatous Neoplasms	615
Pediatric/AYA Brain Tumors	199
Not Applicable	441
Head and Neck Squamous Cell Carcinoma	110
Granular Cell Tumors and Alveolar Soft Part Sarcomas	23
Early Onset Gastric Cancer	80
Chronic Myeloproliferative Disorders	477
Hodgkin Lymphoma	11
Trophoblastic neoplasms	8
Unknown	63
Adnexal and Skin Appendage Neoplasms	58
Malignant Lymphomas, NOS or Diffuse	42
Nerve Sheath Tumors	115
Mature T- and NK-Cell Lymphomas	94
Neoplasms of Histiocytes and Accessory Lymphoid Cells	66
Rectum Adenocarcinoma	30
Miscellaneous Tumors	89
Specialized Gonadal Neoplasms	124
Blood Vessel Tumors	156
Immunoproliferative Diseases	4
Fibroepithelial Neoplasms	25
Mucoepidermoid Neoplasms	60
Miscellaneous Bone Tumors	130
Other Leukemias	68
Precursor Cell Lymphoblastic Lymphoma	12
Mast Cell Tumors	10
Basal Cell Neoplasms	45
Myxomatous Neoplasms	18
Other Hematologic Disorders	20
Odontogenic Tumors	3
Giant Cell Tumors	3
Papillary Renal Cell Carcinoma	2
Lymphatic Vessel Tumors	1
Chromophobe Renal Cell Carcinoma	1

primary_diagnosis_site	count
Rectum	1310
Corpus uteri	780
Ovary	4349
Thyroid gland	1880
Kidney	4954
Colon	8575
Bones, joints and articular cartilage of limbs	268
Hematopoietic and reticuloendothelial systems	9635
Spinal cord, cranial nerves, and other parts of central nervous system	3703
Breast	24108
Heart, mediastinum, and pleura	706
Bronchus and lung	12385
Head-Neck	3053
Adrenal gland	851
Other and ill-defined sites	1189
Esophagus	1588
Connective, subcutaneous and other soft tissues	1574
Liver and intrahepatic bile ducts	1609
Brain	4347
Base of tongue	24
Lymph nodes	538
Unknown	3233
None	341
Other and unspecified major salivary glands	615
Other and unspecified parts of biliary tract	226
Stomach	1876
Chest	28221
Lung	4848
Head and Neck	148
Pancreas	3399
Prostate gland	2354
Nasopharynx	101
Thymus	431
Tonsil	46
Not Reported	512
Other and unspecified female genital organs	161
Other and unspecified parts of tongue	133
Bladder	2156
Retroperitoneum and peritoneum	384
Skin	3517
Palate	5
Abdomen	92
Ureter	15
Testis	542
Small intestine	269
Other and ill-defined sites in lip, oral cavity and pharynx	361
Peripheral nerves and autonomic nervous system	418
Other and ill-defined digestive organs	720
Liver	667
Larynx	169
Rectosigmoid junction	80
Other and unspecified parts of mouth	43
Abdomen, Mediastinum	176
Various (11 locations)	89
Head	105
Prostate	2139
Lung Phantom	8
Various	449
Pelvis, Prostate, Anus	58
Phantom	33
Extremities	51
Abdomen, Pelvis	230
Testicles	150
Thyroid	507
Chest-Abdomen-Pelvis, Leg, TSpine	261
Adrenal Glands	271
Mesothelium	87
Intraocular	80
Ear	242
Floor of mouth	56
Oropharynx	194
Anus and anal canal	235
Eye and adnexa	222
Vulva	10
Marrow, Blood	89
Bile Duct	51
Gallbladder	265
Other endocrine glands and related structures	181
Other and unspecified urinary organs	217
Lip	9
Bones, joints and articular cartilage of other and unspecified sites	455
Meninges	243
Hypopharynx	25
Vagina	72
Penis	33
Gum	11
Trachea	7
Nasal cavity and middle ear	40
Pancreas	1
Other and ill-defined sites within respiratory system and intrathoracic organs	2
Other and unspecified male genital organs	1
Renal pelvis	1

researchsubject_identifier_system	count
GDC	83583
IDC	64942
PDC	2340

Out[18]:

`%` pattern matching¶

While OR is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:

In [19]:

            
                Copied!
                
unique_terms("primary_diagnosis_site").to_list(filters="uter")
unique_terms("primary_diagnosis_site").to_list(filters="uter")

Out[19]:

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

The % operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:

In [20]:

            
                Copied!
                
Q('primary_diagnosis_site = "uter%"').researchsubject.count.run()
Q('primary_diagnosis_site = "uter%"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.397 sec 3397 ms

   files : 258365

    total : 2867

primary_diagnosis_condition	count
Myomatous Neoplasms	184
Uterine Corpus Endometrial Carcinoma	104
Adenomas and Adenocarcinomas	1038
Not Reported	12
Complex Mixed and Stromal Neoplasms	294
None	867
Epithelial Neoplasms, NOS	20
Cystic, Mucinous and Serous Neoplasms	313
Soft Tissue Tumors and Sarcomas, NOS	14
Trophoblastic neoplasms	13
Neoplasms, NOS	3
Complex Epithelial Neoplasms	2
Mesonephromas	2
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Uterus, NOS	2000
Uterus	867

researchsubject_identifier_system	count
GDC	1896
PDC	104
IDC	867

Out[20]:

Because the % is at the end of "uter" this query returns anything that starts with "uter", depending on your question, you may want to move the %, or add more of them:

In [21]:

            
                Copied!
                
Q('primary_diagnosis_site = "%uter"').researchsubject.count.run()
Q('primary_diagnosis_site = "%uter"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.279 sec 3279 ms

     files : 0

     total : 0

Out[21]:

In [22]:

            
                Copied!
                
Q('primary_diagnosis_site = "%uter%"').researchsubject.count.run()
Q('primary_diagnosis_site = "%uter%"').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.219 sec 3219 ms

   files : 300891

    total : 4562

primary_diagnosis_condition	count
Uterine Corpus Endometrial Carcinoma	104
Adenomas and Adenocarcinomas	1672
Cystic, Mucinous and Serous Neoplasms	487
Myomatous Neoplasms	188
Squamous Cell Neoplasms	609
Complex Mixed and Stromal Neoplasms	320
Not Reported	12
None	868
Complex Epithelial Neoplasms	27
Epithelial Neoplasms, NOS	230
Soft Tissue Tumors and Sarcomas, NOS	14
Neoplasms, NOS	12
Trophoblastic neoplasms	13
Mesonephromas	5
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Cervix uteri	915
Uterus, NOS	2000
Corpus uteri	780
Uterus	867

researchsubject_identifier_system	count
PDC	104
GDC	3591
IDC	867

Out[22]:

There may be cases in which you want to filter out all of the data with some partial word in it, in which case, you can combine % with !=:

In [23]:

            
                Copied!
                
Q('sex != "f%"').subject.count.run()
Q('sex != "f%"').subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.237 sec 3237 ms

  files : 47800162

   total : 94904

sex	count
None	54299
male	40252
unknown	83
not reported	265
unspecified	5

race	count
None	54299
american indian or alaska native	57
not reported	9992
white	23670
chinese	65
Unknown	2079
black or african american	1832
asian	1351
other	418
not allowed to collect	1106
native hawaiian or other pacific islander	27
unknown	8

ethnicity	count
not hispanic or latino	23258
not reported	11917
None	54299
Unknown	2357
hispanic or latino	1479
not allowed to collect	1586
unknown	8

cause_of_death	count
Not Reported	307
None	94449
HCC recurrence	5
Infection	3
Cancer Related	98
Not Cancer Related	10
Unknown	25
Surgical Complications	3
Cerebral Hemorrhage	1
Cardiovascular Disorder, NOS	2
Metastasis	1

subject_identifier_system	count
IDC	59747
PDC	1119
GDC	40676

Out[23]:

IS and IS NOT¶

In computing, lack of data is often treated as a special case. In the CDA, values listed as "None" are actually null, that is, the field is empty. In order to search for emptiness, you need to use the special function IS:

In [24]:

            
                Copied!
                
Q('primary_diagnosis_condition IS null').researchsubject.count.run()
Q('primary_diagnosis_condition IS null').researchsubject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.29 sec 3290 ms

  files : 48488955

   total : 66118

primary_diagnosis_condition	count
None	66118

primary_diagnosis_site	count
Breast	14676
Head-Neck	3053
Colon	1491
Chest	28221
Lung	4728
Ovary	664
Abdomen	92
Abdomen, Mediastinum	176
Brain	2583
Bladder	431
Skin	628
Uterus	867
Various (11 locations)	89
Marrow, Blood	89
Kidney	1373
Pancreas	502
Head	105
Liver	497
Prostate	2139
Lung Phantom	8
Various	449
Pelvis, Prostate, Anus	58
Phantom	33
Extremities	51
Abdomen, Pelvis	230
Testicles	150
Rectum	171
Stomach	443
Thyroid	507
Chest-Abdomen-Pelvis, Leg, TSpine	261
Cervix	307
Esophagus	187
Mesothelium	87
Adrenal Glands	271
Thymus	125
Ear	242
Bile Duct	51
Intraocular	80
Pancreas	1
Cervix uteri	1
None	1

researchsubject_identifier_system	count
IDC	66116
GDC	2

Out[24]:

Probably more common, is to want to filter out the empty fields, in which case you use its companion function IS_NOT:

In [25]:

            
                Copied!
                
Q('sex IS NOT null').subject.count.run()
Q('sex IS NOT null').subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.408 sec 3408 ms

  files : 3624760

   total : 86503

sex	count
not reported	265
female	45898
male	40252
unspecified	5
unknown	83

race	count
white	49591
black or african american	4617
not reported	21980
chinese	90
asian	2965
unknown	14
Unknown	4063
not allowed to collect	2058
native hawaiian or other pacific islander	56
american indian or alaska native	117
other	952

ethnicity	count
not hispanic or latino	48814
not reported	26272
hispanic or latino	3179
unknown	19
Unknown	4567
not allowed to collect	3652

cause_of_death	count
Not Reported	750
None	85496
HCC recurrence	7
Cancer Related	175
Not Cancer Related	14
Unknown	45
Infection	6
Metastasis	2
Surgical Complications	3
Cancer cell proliferation	1
Cerebral Hemorrhage	1
Cardiovascular Disorder, NOS	3

subject_identifier_system	count
PDC	2338
GDC	85721
IDC	12299

Out[25]:

Greater and Less than¶

While all of the above can also be used to search for numbers, there are four operators that only work for numerical values:

> : Greater than
< : Less than
>= : Greater than or Equal to
<= : Less than or Equal to

These can all be used in place of the = sign in queries where you are filtering by a numeric value. In this search, we find all the subjects who were over 50 years old when they entered the study. As the study entry date is day 0, days_to_birth is reported as a negative number.

In [26]:

            
                Copied!
                
Q('days_to_birth <= 50 * -365').specimen.run().to_dataframe()
Q('days_to_birth <= 50 * -365').specimen.run().to_dataframe()

Getting results from database

                        Total execution time: 0
                        min 3.353 sec 3353 ms

Out[26]:

specimen_id	specimen_identifier	specimen_associated_project	days_to_collection	primary_disease_type	anatomical_site	source_material_type	specimen_type	derived_from_specimen	derived_from_subject	subject_id	researchsubject_id
Loading... (need help?)

Note that since the number is negative, greater than and less than signs need to be reversed from how they would work with positive numbers. So to get specimens between 20 and 50 years old we need to convert those ages to days_to_birth and then search for days_to_birth numbers less than the 20 year calculation and more than the 50 year:

20-365 = -7300 50-365 = -18250

In [27]:

            
                Copied!
                
Q('days_to_birth >= 50* -365 AND days_to_birth <= 20 * -365').specimen.run().to_dataframe()
Q('days_to_birth >= 50* -365 AND days_to_birth <= 20 * -365').specimen.run().to_dataframe()

Getting results from database

                        Total execution time: 0
                        min 3.26 sec 3260 ms

Out[27]:

specimen_id	specimen_identifier	specimen_associated_project	days_to_collection	primary_disease_type	anatomical_site	source_material_type	specimen_type	derived_from_specimen	derived_from_subject	subject_id	researchsubject_id
Loading... (need help?)

In [28]:

            
                Copied!
                
Query1 = Q('days_to_birth >= 50 * -365') 
Query2 = Q('days_to_birth <= 20 * -365')

Query1.AND(Query2).subject.count.run()
Query1 = Q('days_to_birth >= 50 * -365') 
Query2 = Q('days_to_birth <= 20 * -365')

Query1.AND(Query2).subject.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.26 sec 3260 ms

   files : 91994

    total : 167

sex	count
female	105
male	62

race	count
unknown	3
white	71
not reported	75
asian	10
black or african american	6
Unknown	1
american indian or alaska native	1

ethnicity	count
not hispanic or latino	62
hispanic or latino	9
not reported	91
unknown	5

cause_of_death	count
Cancer Related	25
Not Reported	29
None	104
Infection	1
Unknown	7
Surgical Complications	1

subject_identifier_system	count
PDC	167
GDC	152
IDC	153

Out[28]:

In [29]:

            
                Copied!
                
Q('days_to_birth >= 50 * -365').specimen.count.run()
Q('days_to_birth >= 50 * -365').specimen.count.run()

Getting results from database

                        Total execution time: 0
                        min 3.735 sec 3735 ms

   files : 31643

    total : 4841

primary_disease_type	count
Adenomas and Adenocarcinomas	1136
Glioblastoma	48
Breast Invasive Carcinoma	52
Other	54
Cystic, Mucinous and Serous Neoplasms	1343
Ductal and Lobular Neoplasms	1227
Acute Myeloid Leukemia	36
Lung Adenocarcinoma	48
Clear Cell Renal Cell Carcinoma	90
Uterine Corpus Endometrial Carcinoma	18
Ovarian Serous Cystadenocarcinoma	64
Pancreatic Ductal Adenocarcinoma	50
Head and Neck Squamous Cell Carcinoma	22
Gliomas	322
Squamous Cell Neoplasms	275
Rectum Adenocarcinoma	8
Lung Squamous Cell Carcinoma	28
Chromophobe Renal Cell Carcinoma	4
Papillary Renal Cell Carcinoma	8
Colon Adenocarcinoma	8

source_material_type	count
Primary Tumor	2896
Metastatic	29
Solid Tissue Normal	646
Recurrent Tumor	49
Blood Derived Normal	1095
Cell Lines	36
Blood Derived Cancer - Peripheral Blood, Post-treatment	18
Blood Derived Cancer - Peripheral Blood	10
Normal	18
Tumor	18
Not Reported	18
Blood Derived Cancer - Bone Marrow	6
Blood Derived Cancer - Bone Marrow, Post-treatment	2

specimen_type	count
portion	798
aliquot	2070
sample	879
analyte	887
slide	207

specimen_identifier_system	count
PDC	538
GDC	4303

Out[29]:

ORDER_BY¶

It can be useful to have results sorted before they are returned. The CDA database by default gives unordered results, which means that the first 100 results for a query with more than 100 total results will change slightly each time it is run. If you want to always see the same first 100, or order results for some other reason, use the ORDER_BY function to choose the column to sort, and how it is sorted:

In [30]:

            
                Copied!
                
Q('primary_diagnosis_site = "%uter%"').ORDER_BY("subject_id").researchsubject.run().to_dataframe()
Q('primary_diagnosis_site = "%uter%"').ORDER_BY("subject_id").researchsubject.run().to_dataframe()

Getting results from database

                        Total execution time: 0
                        min 3.286 sec 3286 ms

Out[30]:

researchsubject_id	researchsubject_identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
Loading... (need help?)

By default, sorting is lowest to highest/A-Z. To reverse the sort, add :-1 to the ORDER_BY statement:

In [31]:

            
                Copied!
                
Q('primary_diagnosis_site = "%uter%"').ORDER_BY("subject_id:-1").researchsubject.run().to_dataframe()
Q('primary_diagnosis_site = "%uter%"').ORDER_BY("subject_id:-1").researchsubject.run().to_dataframe()

Getting results from database

                        Total execution time: 0
                        min 3.195 sec 3195 ms

Out[31]:

researchsubject_id	researchsubject_identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
Loading... (need help?)

Note that if you are using itables, as we are here, you can sort the results using the arrows next to the column names. However this sorts the results after they are returned to you, whereas ORDER_BY sorts the results in the database before they are returned. If you always want to get the same first 100 results from a query that has 1000 total results, use ORDER_BY. If you just want the results that you are looking at sorted, use the arrows.

Last update: 2022-11-03

Using Operators¶

Equals: =¶

Not Equal: !=¶

OR¶