Getting Results (Pagination)¶

The CDA provides a custom python tool for searching CDA data. Q (short for Query) offers several ways to search and filter data, and several input modes:

Q.() builds a query that can be used by run() or count()
Q.run() returns data for the specified search
Q.count() returns summary information (counts) data that fit the specified search
columns() returns entity field names
unique_terms() returns entity field contents

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also importing functions from several other packages to make viewing and manipulating tables easier. The opt. settings are pre-configuring how itables should display our tables, with scrolling and paging enabled. Finally, we're telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:

            
                Copied!
                
                    
                    
                
                

        
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())
from cdapython import Q, columns, unique_terms, query
import numpy as np
import pandas as pd
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes=0
opt.scrollX="200px"
opt.scrollCollapse=True
opt.paging=True
opt.maxColumns=0
print(Q.get_version())

2022.12.21

The CDA indexes tens of thousands of subjects, researchsubjects, speciments, and their diagnosis data. CDA also indexes more than 45 million files. To keep search results from being overwhelming, CDA limits search results to the first 100 records by default:

In [2]:

            
                Copied!
                
myquery = Q('primary_diagnosis_site = "brain"')
brainresults = myquery.subject.run()
brainresults
myquery = Q('primary_diagnosis_site = "brain"')
brainresults = myquery.subject.run()
brainresults

Getting results from database

                        Total execution time: 0
                        min 3.332 sec 3332 ms

Out[2]:

            
            Offset: 0
            Count: 100
            Total Row Count: 3015
            More pages: True

Offset: This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
Count: This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
Total Row Count: This is how many rows are in the full results table
More pages: This is always a True or False. False means that our current page has all the available results. True means that we will see only the first 100 results in this table, and will need to page through for more.

This preview behaviour is great for search, but not when you are ready to download the information. The pagination feature is how you retrieve all the results for your final query and not just the first 100.

You can have your full results output to a dataframe or a list.

Results to a dataframe¶

The easiest way to get results to a dataframe is to use the auto_paginator(), which directly turns your results into a full dataframe:

In [3]:

            
                Copied!
                
paginated_brain_results = brainresults.auto_paginator(to_df=True)
paginated_brain_results = brainresults.auto_paginator(to_df=True)

In [4]:

            
                Copied!
                
paginated_brain_results # view the dataframe
paginated_brain_results # view the dataframe

Out[4]:

	subject_id	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

If you'd like more control over your output, you can use the paginator. It will also get all the data, but requires you to write the looping code.

In the simplest case, you create an empty dataframe for the data to land in, then use the paginator in a loop to get all the results:

In [5]:

            
                Copied!
                
mydf = pd.DataFrame()
for i in myquery.subject.run().paginator(to_df=True):
    mydf = pd.concat([mydf, i])
mydf = pd.DataFrame()
for i in myquery.subject.run().paginator(to_df=True):
    mydf = pd.concat([mydf, i])

Getting results from database

                        Total execution time: 0
                        min 3.439 sec 3439 ms

In [6]:

            
                Copied!
                
mydf  # view the dataframe
mydf  # view the dataframe

Out[6]:

	subject_id	subject_identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	days_to_death	cause_of_death
Loading... (need help?)

Results to a list¶

If you want results in a list, you'll need to use the paginator. Saving to a list works similarly to the dataframe call. The differences are:

initiate a list not a DataFrame
change to_list=True
change the concat index to extend()

In [7]:

            
                Copied!
                
mylist = []
for i in myquery.subject.run().paginator(to_list=True):
    mylist.extend(i)
mylist = []
for i in myquery.subject.run().paginator(to_list=True):
    mylist.extend(i)

Getting results from database

                        Total execution time: 0
                        min 3.423 sec 3423 ms

This gives back the correct number of results:

In [8]:

            
                Copied!
                
len(mylist)
len(mylist)

Out[8]:

And we can preview the first result to see that it has the same values:

In [9]:

            
                Copied!
                
mylist[0:1]
mylist[0:1]

Out[9]:

[{'subject_id': '900-00-5341',
  'subject_identifier': [{'system': 'IDC', 'value': '900-00-5341'}],
  'species': 'homo sapiens',
  'sex': None,
  'race': None,
  'ethnicity': None,
  'days_to_birth': None,
  'subject_associated_project': ['rembrandt'],
  'vital_status': None,
  'days_to_death': None,
  'cause_of_death': None}]

Last update: 2022-11-03