Using Operators¶
Operators allow us to make more complex queries by adding, subtracting, or filtering data.
Q
uses the following operators:
=
: Equals!=
: Not EqualOR
AND
FROM
IN
andNOT IN
%
: pattern matching a wildcardIS
andIS NOT
>
,>
,>=
,<=
: Greater and Less thanORDER_BY
We use these operators to build more and more complex Q statements before sending our query to run()
or count()
.
The CDA provides a custom python tool for searching CDA data. Q
(short for Query) offers several ways to search and filter data, and several input modes:
- Q.() builds a query that can be used by
run()
orcount()
- Q.run() returns data for the specified search
- Q.count() returns summary information (counts) data that fit the specified search
- columns() returns entity field names
- unique_terms() returns entity field contents
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt.
settings are pre-configuring how itables should display our tables, with scrolling and paging enabled.
Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
2022.12.21
Equals: =
¶
In the other tutorials, we have always used the same query, which uses the =
operator.
Q('primary_diagnosis_site = "brain"')
This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:
Q('primary_diagnosis_site = "uterus"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.241 sec 3241 ms
files : 243236
total : 867
primary_diagnosis_condition | count |
---|---|
None | 867 |
primary_diagnosis_site | count |
---|---|
Uterus | 867 |
researchsubject_identifier_system | count |
---|---|
IDC | 867 |
Not Equal: !=
¶
The !=
operator does the opposite of the =
operator, it returns everything that is not exactly the term you give it:
Q('primary_diagnosis_site != "uterus"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.317 sec 3317 ms
files : 49187062
total : 154087
primary_diagnosis_condition | count |
---|---|
Paragangliomas and Glomus Tumors | 241 |
Adenomas and Adenocarcinomas | 32975 |
Squamous Cell Neoplasms | 5077 |
Neuroepitheliomatous Neoplasms | 1332 |
Ductal and Lobular Neoplasms | 7894 |
None | 65251 |
Nevi and Melanomas | 3158 |
Transitional Cell Papillomas and Carcinomas | 1885 |
Gliomas | 4775 |
Lipomatous Neoplasms | 343 |
Breast Invasive Carcinoma | 316 |
Lymphoid Leukemias | 2072 |
Other | 212 |
Fibromatous Neoplasms | 322 |
Plasma Cell Tumors | 1066 |
Mature B-Cell Lymphomas | 1019 |
Myomatous Neoplasms | 633 |
Myeloid Leukemias | 4545 |
Adnexal and Skin Appendage Neoplasms | 58 |
Glioblastoma | 100 |
Epithelial Neoplasms, NOS | 5696 |
Cystic, Mucinous and Serous Neoplasms | 3723 |
Acute Myeloid Leukemia | 41 |
Colon Adenocarcinoma | 164 |
Lung Squamous Cell Carcinoma | 118 |
Pancreatic Ductal Adenocarcinoma | 144 |
Not Reported | 271 |
Specialized Gonadal Neoplasms | 124 |
Complex Mixed and Stromal Neoplasms | 1827 |
Clear Cell Renal Cell Carcinoma | 116 |
Lung Adenocarcinoma | 216 |
Osseous and Chondromatous Neoplasms | 615 |
Germ Cell Neoplasms | 703 |
Uterine Corpus Endometrial Carcinoma | 104 |
Thymic Epithelial Neoplasms | 262 |
Ovarian Serous Cystadenocarcinoma | 283 |
Soft Tissue Tumors and Sarcomas, NOS | 315 |
Oral Squamous Cell Carcinoma | 38 |
Leukemias, NOS | 123 |
Neoplasms, NOS | 1359 |
Complex Epithelial Neoplasms | 254 |
Meningiomas | 289 |
Acinar Cell Neoplasms | 300 |
Mesothelial Neoplasms | 647 |
Hepatocellular Carcinoma | 170 |
Nerve Sheath Tumors | 115 |
Not Applicable | 441 |
Pediatric/AYA Brain Tumors | 199 |
Blood Vessel Tumors | 156 |
Head and Neck Squamous Cell Carcinoma | 110 |
Early Onset Gastric Cancer | 80 |
Unknown | 63 |
Rectum Adenocarcinoma | 30 |
Chronic Myeloproliferative Disorders | 477 |
Myelodysplastic Syndromes | 389 |
Miscellaneous Tumors | 89 |
Miscellaneous Bone Tumors | 130 |
Trophoblastic neoplasms | 21 |
Neoplasms of Histiocytes and Accessory Lymphoid Cells | 66 |
Synovial-like Neoplasms | 98 |
Basal Cell Neoplasms | 45 |
Hodgkin Lymphoma | 11 |
Mucoepidermoid Neoplasms | 60 |
Precursor Cell Lymphoblastic Lymphoma | 12 |
Papillary Renal Cell Carcinoma | 2 |
Mature T- and NK-Cell Lymphomas | 94 |
Malignant Lymphomas, NOS or Diffuse | 42 |
Other Leukemias | 68 |
Mesonephromas | 5 |
Other Hematologic Disorders | 20 |
Mast Cell Tumors | 10 |
Granular Cell Tumors and Alveolar Soft Part Sarcomas | 23 |
Fibroepithelial Neoplasms | 25 |
Myxomatous Neoplasms | 18 |
Giant Cell Tumors | 3 |
Immunoproliferative Diseases | 4 |
Chromophobe Renal Cell Carcinoma | 1 |
Odontogenic Tumors | 3 |
Lymphatic Vessel Tumors | 1 |
primary_diagnosis_site | count |
---|---|
Thyroid gland | 1880 |
Uterus, NOS | 2000 |
Not Reported | 512 |
Brain | 4347 |
Kidney | 4954 |
Stomach | 1876 |
Breast | 24108 |
Small intestine | 269 |
Colon | 8575 |
Head and Neck | 148 |
Liver and intrahepatic bile ducts | 1609 |
Hematopoietic and reticuloendothelial systems | 9635 |
Corpus uteri | 780 |
Bronchus and lung | 12385 |
Pancreas | 3399 |
Head-Neck | 3053 |
Larynx | 169 |
Skin | 3517 |
Ovary | 4349 |
Other and ill-defined sites | 1189 |
Esophagus | 1588 |
Bladder | 2156 |
Other and ill-defined sites in lip, oral cavity and pharynx | 361 |
Cervix uteri | 915 |
Other and unspecified parts of tongue | 133 |
Prostate gland | 2354 |
Chest | 28221 |
Lung | 4848 |
Adrenal gland | 851 |
Retroperitoneum and peritoneum | 384 |
Other and unspecified parts of biliary tract | 226 |
Lymph nodes | 538 |
Nasopharynx | 101 |
Meninges | 243 |
Eye and adnexa | 222 |
Rectum | 1310 |
Oropharynx | 194 |
Bones, joints and articular cartilage of limbs | 268 |
Peripheral nerves and autonomic nervous system | 418 |
Other and unspecified female genital organs | 161 |
Other and unspecified major salivary glands | 615 |
Unknown | 3233 |
Abdomen | 92 |
Spinal cord, cranial nerves, and other parts of central nervous system | 3703 |
Other and ill-defined digestive organs | 720 |
Other and unspecified parts of mouth | 43 |
Connective, subcutaneous and other soft tissues | 1574 |
Testis | 542 |
Thymus | 431 |
Rectosigmoid junction | 80 |
Liver | 667 |
Heart, mediastinum, and pleura | 706 |
Floor of mouth | 56 |
Gallbladder | 265 |
Gum | 11 |
Abdomen, Mediastinum | 176 |
Various (11 locations) | 89 |
Marrow, Blood | 89 |
Head | 105 |
Prostate | 2139 |
Various | 449 |
Pelvis, Prostate, Anus | 58 |
Phantom | 33 |
Extremities | 51 |
Abdomen, Pelvis | 230 |
Testicles | 150 |
Chest-Abdomen-Pelvis, Leg, TSpine | 261 |
Bile Duct | 51 |
Cervix | 307 |
Thyroid | 507 |
Adrenal Glands | 271 |
Mesothelium | 87 |
Intraocular | 80 |
Ear | 242 |
Tonsil | 46 |
Bones, joints and articular cartilage of other and unspecified sites | 455 |
Penis | 33 |
Base of tongue | 24 |
Vagina | 72 |
Anus and anal canal | 235 |
Hypopharynx | 25 |
Nasal cavity and middle ear | 40 |
None | 341 |
Other and unspecified urinary organs | 217 |
Other endocrine glands and related structures | 181 |
Trachea | 7 |
Pancreas | 1 |
Lung Phantom | 8 |
Lip | 9 |
Ureter | 15 |
Vulva | 10 |
Renal pelvis | 1 |
Other and ill-defined sites within respiratory system and intrathoracic organs | 2 |
Palate | 5 |
Other and unspecified male genital organs | 1 |
researchsubject_identifier_system | count |
---|---|
GDC | 86394 |
PDC | 2444 |
IDC | 65249 |
Note that in our !=
results, there are 2000 "Uterus, NOS" samples. These don't appear in our =
search because "Uterus, NOS" is not exactly "Uterus".
There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.
OR¶
If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results. In an OR
query, each data point only needs to meet a single piece of criteria to be returned, this makes OR
good for early, broad searches. It increases the amount of data returned.
OR
can be used both inside a Q statement:
Q('primary_diagnosis_site = "uterus" OR primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.272 sec 3272 ms
files : 258365
total : 2867
primary_diagnosis_condition | count |
---|---|
Cystic, Mucinous and Serous Neoplasms | 313 |
Complex Mixed and Stromal Neoplasms | 294 |
Uterine Corpus Endometrial Carcinoma | 104 |
Adenomas and Adenocarcinomas | 1038 |
None | 867 |
Myomatous Neoplasms | 184 |
Epithelial Neoplasms, NOS | 20 |
Trophoblastic neoplasms | 13 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Not Reported | 12 |
Mesonephromas | 2 |
Complex Epithelial Neoplasms | 2 |
Neoplasms, NOS | 3 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 2000 |
Uterus | 867 |
researchsubject_identifier_system | count |
---|---|
PDC | 104 |
GDC | 1896 |
IDC | 867 |
and to combine 2 or more Q statements:
Query1 = Q('primary_diagnosis_site = "uterus, NOS"')
Query2 = Q('primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')
Query1.OR(Query2).researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.254 sec 3254 ms
files : 15129
total : 2000
primary_diagnosis_condition | count |
---|---|
Myomatous Neoplasms | 184 |
Uterine Corpus Endometrial Carcinoma | 104 |
Cystic, Mucinous and Serous Neoplasms | 313 |
Adenomas and Adenocarcinomas | 1038 |
Complex Mixed and Stromal Neoplasms | 294 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 3 |
Trophoblastic neoplasms | 13 |
Epithelial Neoplasms, NOS | 20 |
Not Reported | 12 |
Mesonephromas | 2 |
Complex Epithelial Neoplasms | 2 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 2000 |
researchsubject_identifier_system | count |
---|---|
GDC | 1896 |
PDC | 104 |
For each OR
you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the OR
operator is flexible enough to run searches across columns, or even across endpoints.
AND¶
Like OR
, AND
can be used both inside a Q statement, and to join multiple Q statements. AND
requires that both statements be true simultaneously for each returned bit of data. This makes AND
good for filtering down results. It decreases the amount of data returned.
For searches where you are interested in subsetting multiple columns, AND
can help you to quickly filter to only the set you want. Note that AND
can be used inside a Q
statement, or to add multiple Q
statements together:
Q('primary_diagnosis_site = "uterus, NOS" AND primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.339 sec 3339 ms
files : 2560
total : 104
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 104 |
researchsubject_identifier_system | count |
---|---|
PDC | 104 |
Query1 = Q('primary_diagnosis_site = "uterus, NOS"')
Query2 = Q('primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')
Query1.AND(Query2).researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.429 sec 3429 ms
files : 2560
total : 104
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 104 |
researchsubject_identifier_system | count |
---|---|
PDC | 104 |
Note that in those examples, we searched for one term in 'primary_diagnosis_site' and the other in 'primary_diagnosis_condition'. AND
always filters, so in most cases using AND
to search two different values in the same column will return no results. For example, primary_diagnosis_site can have only one value, so it can never be both "uterus" and "uterus, NOS":
Q('primary_diagnosis_site = "uterus" AND primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.267 sec 3267 ms
files : 0
total : 0
FROM
¶
There are some instances where asking for multiple values in the same column is a valid question: columns where a single row can have more than one value in the same field. A good example of this is 'subject_identifier_system', which is the field in the subject table that lists all of the sources for data for that subject:
Q('primary_diagnosis_site = "uterus"').subject.run(limit=3).to_dataframe()
Getting results from database
Total execution time: 0 min 3.23 sec 3230 ms
subject_id | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death |
---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
To ask for all the subjects that have 'primary_diagnosis_site = "uterus"' AND that have data from GDC, PDC, and IDC we use the special operator FROM
, which lets us make a query, and then search within the answer. This multi-step query allows us to search for multiple values in the same column.
For instance, to get 'primary_diagnosis_site = "uterus"' from subjects who have data at GDC and PDC we would start by building a query for uterus data from GDC, and save it to a variable:
firstquery = Q('primary_diagnosis_site = "uterus" FROM subject_identifier_system = "GDC"')
We can check to see what those results look like overall by running the query:
firstquery.subject.count.run()
Getting results from database
Total execution time: 0 min 3.289 sec 3289 ms
files : 157369
total : 717
sex | count |
---|---|
female | 705 |
None | 12 |
race | count |
---|---|
white | 476 |
black or african american | 121 |
not reported | 71 |
american indian or alaska native | 4 |
asian | 24 |
None | 12 |
native hawaiian or other pacific islander | 9 |
ethnicity | count |
---|---|
not hispanic or latino | 461 |
hispanic or latino | 20 |
not reported | 224 |
None | 12 |
cause_of_death | count |
---|---|
Not Reported | 96 |
None | 617 |
Cancer Related | 3 |
Not Cancer Related | 1 |
subject_identifier_system | count |
---|---|
IDC | 717 |
PDC | 100 |
GDC | 717 |
That looks fine, so lets add a query that looks within that result for PDC:
secondquery = firstquery.FROM('subject_identifier_system = "PDC"')
and finally run that query on the table of our choice:
secondquery.subject.count.run()
Getting results from database
Total execution time: 0 min 3.359 sec 3359 ms
files : 47447
total : 100
sex | count |
---|---|
female | 100 |
race | count |
---|---|
white | 58 |
not reported | 38 |
black or african american | 3 |
asian | 1 |
ethnicity | count |
---|---|
not hispanic or latino | 41 |
hispanic or latino | 4 |
not reported | 55 |
cause_of_death | count |
---|---|
Cancer Related | 3 |
Not Reported | 96 |
Not Cancer Related | 1 |
subject_identifier_system | count |
---|---|
PDC | 100 |
GDC | 100 |
IDC | 100 |
To further filter the results to only subjects with data in GDC, PDC, and IDC, we would add a third query for IDC and run it:
secondquery.FROM('subject_identifier_system = "IDC"').subject.count.run()
Getting results from database
Total execution time: 0 min 3.305 sec 3305 ms
files : 47447
total : 100
sex | count |
---|---|
female | 100 |
race | count |
---|---|
white | 58 |
not reported | 38 |
asian | 1 |
black or african american | 3 |
ethnicity | count |
---|---|
not hispanic or latino | 41 |
hispanic or latino | 4 |
not reported | 55 |
cause_of_death | count |
---|---|
Not Reported | 96 |
Cancer Related | 3 |
Not Cancer Related | 1 |
subject_identifier_system | count |
---|---|
GDC | 100 |
PDC | 100 |
IDC | 100 |
You can also do all of these steps in a single call if you prefer (the triple quotes allow us to make the query on multiple lines that are easier to read)
Q("""primary_diagnosis_site = 'uterus'
FROM subject_identifier_system = 'GDC'
FROM subject_identifier_system = 'PDC'
FROM subject_identifier_system = 'IDC'""").subject.count.run()
Getting results from database
Total execution time: 0 min 3.37 sec 3370 ms
files : 47447
total : 100
sex | count |
---|---|
female | 100 |
race | count |
---|---|
white | 58 |
not reported | 38 |
asian | 1 |
black or african american | 3 |
ethnicity | count |
---|---|
not hispanic or latino | 41 |
hispanic or latino | 4 |
not reported | 55 |
cause_of_death | count |
---|---|
Not Reported | 96 |
Cancer Related | 3 |
Not Cancer Related | 1 |
subject_identifier_system | count |
---|---|
GDC | 100 |
PDC | 100 |
IDC | 100 |
Remember that just because we're searching a subject field, doesn't mean we have to search the subject table. We could instead find all of the researchsubjects for primary_diagnosis_site = "uterus"
that are associated with a subject who has data from all 3 data sources:
secondquery.FROM('subject_identifier_system = "IDC"').researchsubject.run()
Getting results from database
Total execution time: 0 min 3.306 sec 3306 ms
Offset: 0 Count: 100 Total Row Count: 100 More pages: True
IN
and NOT IN
¶
For instances where you have many search terms, it may be easier (and more readable) to use IN
. With IN
you make a list of all the terms you are interested in, and ask whether they are IN
a given field:
Q('primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.279 sec 3279 ms
files : 299897
total : 4089
primary_diagnosis_condition | count |
---|---|
Squamous Cell Neoplasms | 609 |
Adenomas and Adenocarcinomas | 1265 |
Cystic, Mucinous and Serous Neoplasms | 348 |
Complex Mixed and Stromal Neoplasms | 294 |
Uterine Corpus Endometrial Carcinoma | 104 |
Myomatous Neoplasms | 184 |
None | 1175 |
Not Reported | 12 |
Complex Epithelial Neoplasms | 27 |
Epithelial Neoplasms, NOS | 26 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neuroepitheliomatous Neoplasms | 1 |
Mesonephromas | 5 |
primary_diagnosis_site | count |
---|---|
Cervix uteri | 915 |
Uterus, NOS | 2000 |
Uterus | 867 |
Cervix | 307 |
researchsubject_identifier_system | count |
---|---|
GDC | 2811 |
PDC | 104 |
IDC | 1174 |
The equivalent request without IN
would require a large number of OR
statements. (The triple quotes surrounding this example are to allow a multi-line Q statement):
Q("""primary_diagnosis_site = "uterus, NOS" OR
primary_diagnosis_site = "uterus" OR
primary_diagnosis_site = "Cervix" OR
primary_diagnosis_site = "Cervix uteri" """).researchsubject.count.run()
NOT IN
is the opposite of IN
, and so gives the inverse results. If we add NOT
to our above query, we get all the researchsubjects who's primary_diagnosis_site was not in our list:
Q('primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.282 sec 3282 ms
files : 49133001
total : 150865
primary_diagnosis_condition | count |
---|---|
Adenomas and Adenocarcinomas | 31710 |
Epithelial Neoplasms, NOS | 5670 |
Ductal and Lobular Neoplasms | 7894 |
Squamous Cell Neoplasms | 4468 |
Nevi and Melanomas | 3158 |
Plasma Cell Tumors | 1066 |
Transitional Cell Papillomas and Carcinomas | 1885 |
None | 64943 |
Other | 212 |
Neoplasms, NOS | 1347 |
Neuroepitheliomatous Neoplasms | 1331 |
Cystic, Mucinous and Serous Neoplasms | 3375 |
Myeloid Leukemias | 4545 |
Gliomas | 4775 |
Myomatous Neoplasms | 449 |
Pancreatic Ductal Adenocarcinoma | 144 |
Breast Invasive Carcinoma | 316 |
Germ Cell Neoplasms | 703 |
Clear Cell Renal Cell Carcinoma | 116 |
Colon Adenocarcinoma | 164 |
Ovarian Serous Cystadenocarcinoma | 283 |
Glioblastoma | 100 |
Soft Tissue Tumors and Sarcomas, NOS | 301 |
Lymphoid Leukemias | 2072 |
Lung Squamous Cell Carcinoma | 118 |
Lung Adenocarcinoma | 216 |
Acinar Cell Neoplasms | 300 |
Mature B-Cell Lymphomas | 1019 |
Not Reported | 259 |
Paragangliomas and Glomus Tumors | 241 |
Fibromatous Neoplasms | 322 |
Complex Mixed and Stromal Neoplasms | 1533 |
Thymic Epithelial Neoplasms | 262 |
Lipomatous Neoplasms | 343 |
Acute Myeloid Leukemia | 41 |
Complex Epithelial Neoplasms | 227 |
Hepatocellular Carcinoma | 170 |
Mesothelial Neoplasms | 647 |
Leukemias, NOS | 123 |
Myelodysplastic Syndromes | 389 |
Synovial-like Neoplasms | 98 |
Oral Squamous Cell Carcinoma | 38 |
Meningiomas | 289 |
Osseous and Chondromatous Neoplasms | 615 |
Pediatric/AYA Brain Tumors | 199 |
Not Applicable | 441 |
Head and Neck Squamous Cell Carcinoma | 110 |
Granular Cell Tumors and Alveolar Soft Part Sarcomas | 23 |
Early Onset Gastric Cancer | 80 |
Chronic Myeloproliferative Disorders | 477 |
Hodgkin Lymphoma | 11 |
Trophoblastic neoplasms | 8 |
Unknown | 63 |
Adnexal and Skin Appendage Neoplasms | 58 |
Malignant Lymphomas, NOS or Diffuse | 42 |
Nerve Sheath Tumors | 115 |
Mature T- and NK-Cell Lymphomas | 94 |
Neoplasms of Histiocytes and Accessory Lymphoid Cells | 66 |
Rectum Adenocarcinoma | 30 |
Miscellaneous Tumors | 89 |
Specialized Gonadal Neoplasms | 124 |
Blood Vessel Tumors | 156 |
Immunoproliferative Diseases | 4 |
Fibroepithelial Neoplasms | 25 |
Mucoepidermoid Neoplasms | 60 |
Miscellaneous Bone Tumors | 130 |
Other Leukemias | 68 |
Precursor Cell Lymphoblastic Lymphoma | 12 |
Mast Cell Tumors | 10 |
Basal Cell Neoplasms | 45 |
Myxomatous Neoplasms | 18 |
Other Hematologic Disorders | 20 |
Odontogenic Tumors | 3 |
Giant Cell Tumors | 3 |
Papillary Renal Cell Carcinoma | 2 |
Lymphatic Vessel Tumors | 1 |
Chromophobe Renal Cell Carcinoma | 1 |
primary_diagnosis_site | count |
---|---|
Rectum | 1310 |
Corpus uteri | 780 |
Ovary | 4349 |
Thyroid gland | 1880 |
Kidney | 4954 |
Colon | 8575 |
Bones, joints and articular cartilage of limbs | 268 |
Hematopoietic and reticuloendothelial systems | 9635 |
Spinal cord, cranial nerves, and other parts of central nervous system | 3703 |
Breast | 24108 |
Heart, mediastinum, and pleura | 706 |
Bronchus and lung | 12385 |
Head-Neck | 3053 |
Adrenal gland | 851 |
Other and ill-defined sites | 1189 |
Esophagus | 1588 |
Connective, subcutaneous and other soft tissues | 1574 |
Liver and intrahepatic bile ducts | 1609 |
Brain | 4347 |
Base of tongue | 24 |
Lymph nodes | 538 |
Unknown | 3233 |
None | 341 |
Other and unspecified major salivary glands | 615 |
Other and unspecified parts of biliary tract | 226 |
Stomach | 1876 |
Chest | 28221 |
Lung | 4848 |
Head and Neck | 148 |
Pancreas | 3399 |
Prostate gland | 2354 |
Nasopharynx | 101 |
Thymus | 431 |
Tonsil | 46 |
Not Reported | 512 |
Other and unspecified female genital organs | 161 |
Other and unspecified parts of tongue | 133 |
Bladder | 2156 |
Retroperitoneum and peritoneum | 384 |
Skin | 3517 |
Palate | 5 |
Abdomen | 92 |
Ureter | 15 |
Testis | 542 |
Small intestine | 269 |
Other and ill-defined sites in lip, oral cavity and pharynx | 361 |
Peripheral nerves and autonomic nervous system | 418 |
Other and ill-defined digestive organs | 720 |
Liver | 667 |
Larynx | 169 |
Rectosigmoid junction | 80 |
Other and unspecified parts of mouth | 43 |
Abdomen, Mediastinum | 176 |
Various (11 locations) | 89 |
Head | 105 |
Prostate | 2139 |
Lung Phantom | 8 |
Various | 449 |
Pelvis, Prostate, Anus | 58 |
Phantom | 33 |
Extremities | 51 |
Abdomen, Pelvis | 230 |
Testicles | 150 |
Thyroid | 507 |
Chest-Abdomen-Pelvis, Leg, TSpine | 261 |
Adrenal Glands | 271 |
Mesothelium | 87 |
Intraocular | 80 |
Ear | 242 |
Floor of mouth | 56 |
Oropharynx | 194 |
Anus and anal canal | 235 |
Eye and adnexa | 222 |
Vulva | 10 |
Marrow, Blood | 89 |
Bile Duct | 51 |
Gallbladder | 265 |
Other endocrine glands and related structures | 181 |
Other and unspecified urinary organs | 217 |
Lip | 9 |
Bones, joints and articular cartilage of other and unspecified sites | 455 |
Meninges | 243 |
Hypopharynx | 25 |
Vagina | 72 |
Penis | 33 |
Gum | 11 |
Trachea | 7 |
Nasal cavity and middle ear | 40 |
Pancreas | 1 |
Other and ill-defined sites within respiratory system and intrathoracic organs | 2 |
Other and unspecified male genital organs | 1 |
Renal pelvis | 1 |
researchsubject_identifier_system | count |
---|---|
GDC | 83583 |
IDC | 64942 |
PDC | 2340 |
%
pattern matching¶
While OR
is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:
unique_terms("primary_diagnosis_site").to_list(filters="uter")
['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']
The %
operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:
Q('primary_diagnosis_site = "uter%"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.397 sec 3397 ms
files : 258365
total : 2867
primary_diagnosis_condition | count |
---|---|
Myomatous Neoplasms | 184 |
Uterine Corpus Endometrial Carcinoma | 104 |
Adenomas and Adenocarcinomas | 1038 |
Not Reported | 12 |
Complex Mixed and Stromal Neoplasms | 294 |
None | 867 |
Epithelial Neoplasms, NOS | 20 |
Cystic, Mucinous and Serous Neoplasms | 313 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Trophoblastic neoplasms | 13 |
Neoplasms, NOS | 3 |
Complex Epithelial Neoplasms | 2 |
Mesonephromas | 2 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 2000 |
Uterus | 867 |
researchsubject_identifier_system | count |
---|---|
GDC | 1896 |
PDC | 104 |
IDC | 867 |
Because the %
is at the end of "uter" this query returns anything that starts with "uter", depending on your question, you may want to move the %
, or add more of them:
Q('primary_diagnosis_site = "%uter"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.279 sec 3279 ms
files : 0
total : 0
Q('primary_diagnosis_site = "%uter%"').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.219 sec 3219 ms
files : 300891
total : 4562
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
Adenomas and Adenocarcinomas | 1672 |
Cystic, Mucinous and Serous Neoplasms | 487 |
Myomatous Neoplasms | 188 |
Squamous Cell Neoplasms | 609 |
Complex Mixed and Stromal Neoplasms | 320 |
Not Reported | 12 |
None | 868 |
Complex Epithelial Neoplasms | 27 |
Epithelial Neoplasms, NOS | 230 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Cervix uteri | 915 |
Uterus, NOS | 2000 |
Corpus uteri | 780 |
Uterus | 867 |
researchsubject_identifier_system | count |
---|---|
PDC | 104 |
GDC | 3591 |
IDC | 867 |
There may be cases in which you want to filter out all of the data with some partial word in it, in which case, you can combine %
with !=
:
Q('sex != "f%"').subject.count.run()
Getting results from database
Total execution time: 0 min 3.237 sec 3237 ms
files : 47800162
total : 94904
sex | count |
---|---|
None | 54299 |
male | 40252 |
unknown | 83 |
not reported | 265 |
unspecified | 5 |
race | count |
---|---|
None | 54299 |
american indian or alaska native | 57 |
not reported | 9992 |
white | 23670 |
chinese | 65 |
Unknown | 2079 |
black or african american | 1832 |
asian | 1351 |
other | 418 |
not allowed to collect | 1106 |
native hawaiian or other pacific islander | 27 |
unknown | 8 |
ethnicity | count |
---|---|
not hispanic or latino | 23258 |
not reported | 11917 |
None | 54299 |
Unknown | 2357 |
hispanic or latino | 1479 |
not allowed to collect | 1586 |
unknown | 8 |
cause_of_death | count |
---|---|
Not Reported | 307 |
None | 94449 |
HCC recurrence | 5 |
Infection | 3 |
Cancer Related | 98 |
Not Cancer Related | 10 |
Unknown | 25 |
Surgical Complications | 3 |
Cerebral Hemorrhage | 1 |
Cardiovascular Disorder, NOS | 2 |
Metastasis | 1 |
subject_identifier_system | count |
---|---|
IDC | 59747 |
PDC | 1119 |
GDC | 40676 |
IS and IS NOT¶
In computing, lack of data is often treated as a special case. In the CDA, values listed as "None" are actually null
, that is, the field is empty. In order to search for emptiness, you need to use the special function IS
:
Q('primary_diagnosis_condition IS null').researchsubject.count.run()
Getting results from database
Total execution time: 0 min 3.29 sec 3290 ms
files : 48488955
total : 66118
primary_diagnosis_condition | count |
---|---|
None | 66118 |
primary_diagnosis_site | count |
---|---|
Breast | 14676 |
Head-Neck | 3053 |
Colon | 1491 |
Chest | 28221 |
Lung | 4728 |
Ovary | 664 |
Abdomen | 92 |
Abdomen, Mediastinum | 176 |
Brain | 2583 |
Bladder | 431 |
Skin | 628 |
Uterus | 867 |
Various (11 locations) | 89 |
Marrow, Blood | 89 |
Kidney | 1373 |
Pancreas | 502 |
Head | 105 |
Liver | 497 |
Prostate | 2139 |
Lung Phantom | 8 |
Various | 449 |
Pelvis, Prostate, Anus | 58 |
Phantom | 33 |
Extremities | 51 |
Abdomen, Pelvis | 230 |
Testicles | 150 |
Rectum | 171 |
Stomach | 443 |
Thyroid | 507 |
Chest-Abdomen-Pelvis, Leg, TSpine | 261 |
Cervix | 307 |
Esophagus | 187 |
Mesothelium | 87 |
Adrenal Glands | 271 |
Thymus | 125 |
Ear | 242 |
Bile Duct | 51 |
Intraocular | 80 |
Pancreas | 1 |
Cervix uteri | 1 |
None | 1 |
researchsubject_identifier_system | count |
---|---|
IDC | 66116 |
GDC | 2 |
Probably more common, is to want to filter out the empty fields, in which case you use its companion function IS_NOT
:
Q('sex IS NOT null').subject.count.run()
Getting results from database
Total execution time: 0 min 3.408 sec 3408 ms
files : 3624760
total : 86503
sex | count |
---|---|
not reported | 265 |
female | 45898 |
male | 40252 |
unspecified | 5 |
unknown | 83 |
race | count |
---|---|
white | 49591 |
black or african american | 4617 |
not reported | 21980 |
chinese | 90 |
asian | 2965 |
unknown | 14 |
Unknown | 4063 |
not allowed to collect | 2058 |
native hawaiian or other pacific islander | 56 |
american indian or alaska native | 117 |
other | 952 |
ethnicity | count |
---|---|
not hispanic or latino | 48814 |
not reported | 26272 |
hispanic or latino | 3179 |
unknown | 19 |
Unknown | 4567 |
not allowed to collect | 3652 |
cause_of_death | count |
---|---|
Not Reported | 750 |
None | 85496 |
HCC recurrence | 7 |
Cancer Related | 175 |
Not Cancer Related | 14 |
Unknown | 45 |
Infection | 6 |
Metastasis | 2 |
Surgical Complications | 3 |
Cancer cell proliferation | 1 |
Cerebral Hemorrhage | 1 |
Cardiovascular Disorder, NOS | 3 |
subject_identifier_system | count |
---|---|
PDC | 2338 |
GDC | 85721 |
IDC | 12299 |
Greater and Less than¶
While all of the above can also be used to search for numbers, there are four operators that only work for numerical values:
>
: Greater than<
: Less than>=
: Greater than or Equal to<=
: Less than or Equal to
These can all be used in place of the =
sign in queries where you are filtering by a numeric value. In this search, we find all the subjects who were over 50 years old when they entered the study. As the study entry date is day 0, days_to_birth
is reported as a negative number.
Q('days_to_birth <= 50 * -365').specimen.run().to_dataframe()
Getting results from database
Total execution time: 0 min 3.353 sec 3353 ms
specimen_id | specimen_identifier | specimen_associated_project | days_to_collection | primary_disease_type | anatomical_site | source_material_type | specimen_type | derived_from_specimen | derived_from_subject | subject_id | researchsubject_id |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Note that since the number is negative, greater than and less than signs need to be reversed from how they would work with positive numbers. So to get specimens between 20 and 50 years old we need to convert those ages to days_to_birth
and then search for days_to_birth
numbers less than the 20 year calculation and more than the 50 year:
20-365 = -7300 50-365 = -18250
Q('days_to_birth >= 50* -365 AND days_to_birth <= 20 * -365').specimen.run().to_dataframe()
Getting results from database
Total execution time: 0 min 3.26 sec 3260 ms
specimen_id | specimen_identifier | specimen_associated_project | days_to_collection | primary_disease_type | anatomical_site | source_material_type | specimen_type | derived_from_specimen | derived_from_subject | subject_id | researchsubject_id |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Query1 = Q('days_to_birth >= 50 * -365')
Query2 = Q('days_to_birth <= 20 * -365')
Query1.AND(Query2).subject.count.run()
Getting results from database
Total execution time: 0 min 3.26 sec 3260 ms
files : 91994
total : 167
sex | count |
---|---|
female | 105 |
male | 62 |
race | count |
---|---|
unknown | 3 |
white | 71 |
not reported | 75 |
asian | 10 |
black or african american | 6 |
Unknown | 1 |
american indian or alaska native | 1 |
ethnicity | count |
---|---|
not hispanic or latino | 62 |
hispanic or latino | 9 |
not reported | 91 |
unknown | 5 |
cause_of_death | count |
---|---|
Cancer Related | 25 |
Not Reported | 29 |
None | 104 |
Infection | 1 |
Unknown | 7 |
Surgical Complications | 1 |
subject_identifier_system | count |
---|---|
PDC | 167 |
GDC | 152 |
IDC | 153 |
Q('days_to_birth >= 50 * -365').specimen.count.run()
Getting results from database
Total execution time: 0 min 3.735 sec 3735 ms
files : 31643
total : 4841
primary_disease_type | count |
---|---|
Adenomas and Adenocarcinomas | 1136 |
Glioblastoma | 48 |
Breast Invasive Carcinoma | 52 |
Other | 54 |
Cystic, Mucinous and Serous Neoplasms | 1343 |
Ductal and Lobular Neoplasms | 1227 |
Acute Myeloid Leukemia | 36 |
Lung Adenocarcinoma | 48 |
Clear Cell Renal Cell Carcinoma | 90 |
Uterine Corpus Endometrial Carcinoma | 18 |
Ovarian Serous Cystadenocarcinoma | 64 |
Pancreatic Ductal Adenocarcinoma | 50 |
Head and Neck Squamous Cell Carcinoma | 22 |
Gliomas | 322 |
Squamous Cell Neoplasms | 275 |
Rectum Adenocarcinoma | 8 |
Lung Squamous Cell Carcinoma | 28 |
Chromophobe Renal Cell Carcinoma | 4 |
Papillary Renal Cell Carcinoma | 8 |
Colon Adenocarcinoma | 8 |
source_material_type | count |
---|---|
Primary Tumor | 2896 |
Metastatic | 29 |
Solid Tissue Normal | 646 |
Recurrent Tumor | 49 |
Blood Derived Normal | 1095 |
Cell Lines | 36 |
Blood Derived Cancer - Peripheral Blood, Post-treatment | 18 |
Blood Derived Cancer - Peripheral Blood | 10 |
Normal | 18 |
Tumor | 18 |
Not Reported | 18 |
Blood Derived Cancer - Bone Marrow | 6 |
Blood Derived Cancer - Bone Marrow, Post-treatment | 2 |
specimen_type | count |
---|---|
portion | 798 |
aliquot | 2070 |
sample | 879 |
analyte | 887 |
slide | 207 |
specimen_identifier_system | count |
---|---|
PDC | 538 |
GDC | 4303 |
ORDER_BY¶
It can be useful to have results sorted before they are returned. The CDA database by default gives unordered results, which means that the first 100 results for a query with more than 100 total results will change slightly each time it is run. If you want to always see the same first 100, or order results for some other reason, use the ORDER_BY
function to choose the column to sort, and how it is sorted:
Q('primary_diagnosis_site = "%uter%"').ORDER_BY("subject_id").researchsubject.run().to_dataframe()
Getting results from database
Total execution time: 0 min 3.286 sec 3286 ms
researchsubject_id | researchsubject_identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id |
---|---|---|---|---|---|
Loading... (need help?) |
By default, sorting is lowest to highest/A-Z. To reverse the sort, add :-1
to the ORDER_BY
statement:
Q('primary_diagnosis_site = "%uter%"').ORDER_BY("subject_id:-1").researchsubject.run().to_dataframe()
Getting results from database
Total execution time: 0 min 3.195 sec 3195 ms
researchsubject_id | researchsubject_identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id |
---|---|---|---|---|---|
Loading... (need help?) |
Note that if you are using itables, as we are here, you can sort the results using the arrows next to the column names. However this sorts the results after they are returned to you, whereas ORDER_BY
sorts the results in the database before they are returned. If you always want to get the same first 100 results from a query that has 1000 total results, use ORDER_BY
. If you just want the results that you are looking at sorted, use the arrows.