Getting Results (Pagination)¶
The CDA provides a custom python tool for searching CDA data. Q
(short for Query) offers several ways to search and filter data, and several input modes:
- Q.() builds a query that can be used by
run()
orcount()
- Q.run() returns data for the specified search
- Q.count() returns summary information (counts) data that fit the specified search
- columns() returns entity field names
- unique_terms() returns entity field contents
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt.
settings are pre-configuring how itables should display our tables, with scrolling and paging enabled.
Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
2022.12.21
The CDA indexes tens of thousands of subjects, researchsubjects, speciments, and their diagnosis data. CDA also indexes more than 45 million files. To keep search results from being overwhelming, CDA limits search results to the first 100 records by default:
myquery = Q('primary_diagnosis_site = "brain"')
brainresults = myquery.subject.run()
brainresults
Getting results from database
Total execution time: 0 min 3.332 sec 3332 ms
Offset: 0 Count: 100 Total Row Count: 3015 More pages: True
- Offset: This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- Count: This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- Total Row Count: This is how many rows are in the full results table
- More pages: This is always a True or False. False means that our current page has all the available results. True means that we will see only the first 100 results in this table, and will need to page through for more.
This preview behaviour is great for search, but not when you are ready to download the information. The pagination feature is how you retrieve all the results for your final query and not just the first 100.
You can have your full results output to a dataframe or a list.
Results to a dataframe¶
The easiest way to get results to a dataframe is to use the auto_paginator()
, which directly turns your results into a full dataframe:
paginated_brain_results = brainresults.auto_paginator(to_df=True)
paginated_brain_results # view the dataframe
subject_id | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
If you'd like more control over your output, you can use the paginator
. It will also get all the data, but requires you to write the looping code.
In the simplest case, you create an empty dataframe for the data to land in, then use the paginator in a loop to get all the results:
mydf = pd.DataFrame()
for i in myquery.subject.run().paginator(to_df=True):
mydf = pd.concat([mydf, i])
Getting results from database
Total execution time: 0 min 3.439 sec 3439 ms
mydf # view the dataframe
subject_id | subject_identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | days_to_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Results to a list¶
If you want results in a list, you'll need to use the paginator. Saving to a list works similarly to the dataframe call. The differences are:
- initiate a list not a DataFrame
- change
to_list=True
- change the concat index to
extend()
mylist = []
for i in myquery.subject.run().paginator(to_list=True):
mylist.extend(i)
Getting results from database
Total execution time: 0 min 3.423 sec 3423 ms
This gives back the correct number of results:
len(mylist)
3015
And we can preview the first result to see that it has the same values:
mylist[0:1]
[{'subject_id': '900-00-5341', 'subject_identifier': [{'system': 'IDC', 'value': '900-00-5341'}], 'species': 'homo sapiens', 'sex': None, 'race': None, 'ethnicity': None, 'days_to_birth': None, 'subject_associated_project': ['rembrandt'], 'vital_status': None, 'days_to_death': None, 'cause_of_death': None}]