TOC
Note: This is not (yet) a manuscript. We are still at the data cleaning/alignment stage and it is far too early to draw conclusions. Rather, this is a regularly updated report that I am sharing with you to keep you in the loop on my work and/or because you are also working on NAACCR, i2b2, Epic, or Sunrise because I value your perspective and perhaps my results might be useful to your own work.

Only de-identified data has been used to generate these results any dates or patient num values you see here are also de-identified (with size of time intervals preserved).

This portion of the study is under Dr. Michalek’s exempt project IRB number HSC20170563N. If you are a researcher who would like a copy of the data, please email me and I will get back to you with further instructions and any additional information needed for our records.

Yellow highlights are items with which I know I need to deal soon. Verbatim names of files, variables/elements, or values are displayed in a special style, like this. Data element names are in addition linked to a glossary at the end of this document, e.g. Surgical Oncology. This is where any relevant cleaning or tranformation steps will be described (in progress). Data elements from NAACCR usually have a NAACCR ID preceding them, e.g. 1780 Quality of Survival. I try to use the word ‘data element’ to describe data in its raw state and ‘variable’ to refer to analysis-ready data that I have already processed. Often one variable incorporates information from multiple data elements. Tables, figures, and sections are also linked from text that references them. If you have a Word version of this document, to follow a link, please hold down the ‘control’ key and click on it. The most current version of this document can be found online at https://rpubs.com/bokov/kidneycancer and it has a built-in chat session.

1 Overview

A recent study of state death records1 reports that among US-born Texans of Hispanic ancestry (7.3 million, 27% of the State’s population), annual age-adjusted mortality rates for kidney cancer are 1.5-fold and 1.4-fold those of non-Hispanic whites for males and females respectively. My goal is to determine whether these findings can be replicated at UT Health (Aim 2) and Massachusetts General Hospital (Aim 3). If there is evidence for an ethnic disparity, I will look for possible mediators of this disparity among socioeconomic, lifestyle, and family history variables (Aim 2a). Otherwise the focus will shift to determining which of these same variables are the best predictors of mortality and recurrence.

At the Clinical Informatics Research Division (CIRD) we operate an i2b22 data warehouse containing deidentified data for over 1.3 million patients from the electronic medical record (EMR) systems of the UT Health faculty practice and the University Health System (UHS) county hospital. We use the HERON3 extract transform load (ETL) process to link data from multiple sources including copies of monthly reports that the Mays Cancer Center sends to the Texas Cancer Registry with detailed information on cancer cases including dates of diagnosis, surgery, and recurrence along with stage and grade at presentation. My first-pass eligibility query returns 2327 patients having one or more of the following in their records: an ICD9 code of 189.0 or any ICD10 code starting with C64; the NAACCR item 0400 Primary Site having a value starting with C64 (Kidney, NOS); or the SEER Primary Site having a value of Kidney and Renal Pelvis.

My second pass criteria narrow the initial cohort to patients that have NAACCR, defined as having a non-missing 0390 Date of Diagnosis and one or both of Kidney, NOS or Kidney and Renal Pelvis. As can be seen from table I only 486 of the patient-set met these criteria and 1841 did not. Actually a total of 673 patients had NAACCR records but 187 of them had kidney cancer documented only in the EMR, but neither Kidney, NOS or Kidney and Renal Pelvis in NAACCR. Next time I re-run my i2b2 query I will include all site of occurrence information from NAACCR not just kidney. This will allow me to find out what types of cancer these patients do in fact have. In Appendix 3.2.1-Appendix 3.2.3 I identified additional exclusion criteria which I will implement in the next major revision of this document.

In sec. 2.1 I summarize the evidence that NAACCR and EMR records are correctly matched with each other. In sec. 2.2 I summarize the minimum set of NAACCR data elements that is sufficient to replicate my analysis in an independent NAACCR data set. In sec. 2.3 I report the extent to which the completeness of NAACCR records can be improved by using EMR records of the same patients. In sec. 3 is a technical demonstration of the data analysis scripts (on a small random sample). In sec. 4 there is a characterization of the full (N=2327) patient cohort. Finally, in sec. 5 I present my plans for overcoming the data issues I found, replicating the analysis on independent data, preparing additional variables, and starting work on Aim 1.

2 Data preparation

2.1 Verifying correct patient linkage

Since this is the first study at our site to make such extensive use of combined EMR and NAACCR data, it is important to first validate the data linkage done by our ETL.

The following data elements exist in both NAACCR and the EMR, respectively: date of birth (0240 Date of Birth and birth_date), marital status (0150 Marital Status at DX and Marital Status), sex (0220 Sex and sex_cd), race (Race (NAACCR 0160-0164) and race_cd), and Hispanic ethnicity (0190 Spanish/Hispanic Origin and Hispanic or Latino). The agreement between NAACCR and the EMR is never going to be 100% with race, Hispanic ancestry, and marital status expected to be especially variable. Nonetheless, if record linkage is correct, when patient counts for NAACCR and EMR are tabulated against each of the above variables, then most of the values should agree.

I confirmed that this is the case for marital status (table VII), sex (table VIII), race (table IX), and Hispanic ancestry (table X). Furthermore, there are 0 eligible patients lacking a 0240 Date of Birth and only 15 with a mismatch between 0240 Date of Birth and birth_date. Independent evidence for correct linkage is that EMR ICD9/10 codes for primary kidney cancer rarely precede 0390 Date of Diagnosis (fig. 5), EMR surgical history of nephrectomy and ICD9/10 codes for acquired absence of a kidney rarely precede 1200 RX Date--Surgery or 3170 RX Date--Most Defin Surg (fig. 6), and death dates from non-NAACCR sources (Death, i2b2, Deceased per SSA , and Expired) rarely precede 1760 Vital Status (fig. 10).

2.2 Required NAACCR data elements.

The primary outcome variables I need are date of initial diagnosis, date of surgery (if any), date of recurrence (if any), and date of death (if any). The primary predictor variable is whether or not a patient is Hispanic. There are many covariates of interest, but these five values are the scaffolding on which the rest of the analysis will be built.

I found the following NAACCR elements sufficient for deriving all the above analytic variables: 0190 Spanish/Hispanic Origin, 1880 Recurrence Type--1st, 3170 RX Date--Most Defin Surg, 1340 Reason for No Surgery, 0390 Date of Diagnosis, 1200 RX Date--Surgery, 1750 Date of Last Contact, 1760 Vital Status, 1770 Cancer Status, 1860 Recurrence Date--1st, Kidney and Renal Pelvis, and Kidney, NOS. More details about how these were selected can be found in Appendix 3.2. In addition the following will almost certainly be needed for covariates or mediators: 0220 Sex, 0240 Date of Birth, 0150 Marital Status at DX, 0250 Birthplace, and any field whose name contains Race, Comorbid/Complication, AJCC, or TNM. For crosschecking it will also be useful to have 2850 CS Mets at DX, 0580 Date of 1st Contact, and 0446 Multiplicity Counter. Additional items are likely to be needed as this project evolves, but the elements listed so far should be sufficient to replicate my analysis on de-identified State or National NAACCR data.

2.3 Merging NAACCR and EMR variables

EMR records can not only enrich the data with additional elements unavailable in NAACCR alone, but might also make it possible to fill in missing 0390 Date of Diagnosis, 3170 RX Date--Most Defin Surg / 1200 RX Date--Surgery, 1860 Recurrence Date--1st, and 1750 Date of Last Contact values. It may even be possible to reconstruct entire records for the 1841 kidney cancer patients in the EMR lacking NAACCR records. However, this depends on how much the EMR and NAACCR versions of a variable agree when neither is missing.

Data elements representing date of death and Hispanic ethnicity are in sufficient agreement ( table X and Appendix 3.2.4 ) to justify merging information from the EMR and NAACCR. The process for combining them is described in the Death, Hispanic (strict), and Hispanic (broad) sections of Appendix 4 respectively. At this time I cannot merge diagnosis, surgery, or recurrence– where data from both sources is available, EMR dates lag considerably behind NAACCR dates ( Appendix 3.2.1-Appendix 3.2.3 ) and their variability is probably larger than the effect size. The surgery and recurrence lags might be because those actual visits are not yet available in the data warehouse and I am only seeing them as reflected in the patient history at visits long after the fact. The diagnosis lag may be due to the decision to proceed with surgery often being made based on imaging data,4 with definitive pathology results only available after surgery (Appendix 3.2.2). Attempting to merge these elements would bias the data and obscure the actual differences. However there are several ways forward that I will discuss in sec. 5 below.

EMR data can still be used to flag records for exclusion pending verification by chart review in cases where EMR codes for kidney cancer or secondary tumors precede Diagnosis or Recurrence respectively. This can also apply to nephrectomy EMR codes and [Surgery][a_tsurg] but I will need to distinguish between the prior nephrectomy being due to cancer versus other indications.

For now I am analyzing the data as if I only have access to NAACCR except mortality where I do it both with ( fig. 3 ) and without ( fig. 4 ) the EMR.

3 Plots of test data

The point of this section is solely to test whether my scripts succeeded in turning the raw data elements into a time-to-event (TTE) variables to which Kaplan-Meier curves can be fit without numeric errors or grossly implausible results. All the plots below are from a small random sample of the data– N=127, 82 Hispanic and 45 non-Hispanic white, 5 unknown excluded. This is further reduced in some cases as described in the figure captions. These sample sizes are not sufficient to detect clinically significant differences and, again, this is not the goal yet. The intent is only to insure that my software performs correctly while keeping myself blinded to the hold-out data on which the hypothesis testing will ultimately be done.

Furthermore, these survival curves are not yet adjusted for covariates such as age or stage at diagnosis. There are also refinements planned to the exclusion criteria which I discuss below in sec. 5.

In all the plots below, the time is expressed in weeks and + signs denote censored events (the last follow-up of patients for whom the respective outcomes were never observed). The lightly-shaded regions around each line are 95% confidence intervals.

Typically 2-4 weeks elapse diagnosis from surgery and providers try to not exceed 4 weeks. Nevertheless years may sometimes elapse due to factors such as an indolent tumors or loss of contact with the patient. About 15% of patients never undergo surgery4. Fig. 1 is in agreement with this. It can also be seen in fig. 1 that 34 surgeries seem to happen on the day of diagnosis. This is plausible if NAACCR diagnosis is based on pathology rather than clinical examination where a positive result is usually coded as a renal mass, not a cancer. In my next data update I intend to also include all ICD9/10 codes for renal mass at which point I will revisit the question of using EMR data to fill in missing diagnosis dates (see sec. 5).

blank

Figure 1: Number of weeks elapsed from Diagnosis (time 0) to Surgery for 82 Hispanic and 45 non-Hispanic white patients with a 3-year follow-up period (any surgeries occurring more than 3 years post-diagnosis are treated as censored).

Figure 2: Number of weeks elapsed from Surgery (time 0) to Recurrence for 67 Hispanic and 34 non-Hispanic white patients. The numbers are lower than for fig. 1 because patients not undergoing surgery are excluded. Here the follow-up period is six years.

blank

Figure 3: Like fig. 2 except now the outcome is 1760 Vital Status for 67 Hispanic and 34 non-Hispanic white patients. Six-year follow-up.

Figure 4: Like fig. 3 but now supplemented EMR information to see how much of a difference it makes. For the predictor Hispanic (broad) is used instead of Hispanic (NAACCR) and for the outcome Death is used instead of 1760 Vital Status . There were 68 Hispanic and 33 non-Hispanic white patients. There were 10 fewer censored events than in fig. 3 which may improve sensitivity in the actual analysis.

blank

4 Cohort Characterization

The below variables are subject to change as the data validation and preparation processes evolve.

Table I: Summary of all the variables in the combined i2b2/NAACCR set broken up by Recurrence Status. Disease-free and Never disease-free have the same meanings as codes 00 and 70 in the NAACCR definition for 1880 Recurrence Type--1st. Recurred is any code other than (00, 70, or 99), and Unknown if recurred or was ever gone is 99. Not in NAACCR means there is an EMR diagnosis of kidney cancer and there may in some cases also be a record for that patient in NAACCR but it does not indicate kidney as the principal site
  Disease-free Never disease-free Recurred Unknown if recurred or was ever gone Not in NAACCR
n 160 211 95 20 1841
Age at Last Contact, combined (mean (sd)) 54.32 (20.42) 63.43 (13.76) 62.51 (15.23) 55.59 (23.01) 61.34 (14.18)
a_hsp_broad (%)
  Hispanic 106 ( 66.2) 116 ( 55.0) 50 ( 52.6) 8 ( 40.0) 857 (46.6)
  non-Hispanic white 47 ( 29.4) 75 ( 35.5) 42 ( 44.2) 10 ( 50.0) 525 (28.5)
  Other 3 ( 1.9) 17 ( 8.1) 3 ( 3.2) 1 ( 5.0) 13 ( 0.7)
  Unknown 4 ( 2.5) 3 ( 1.4) 0 1 ( 5.0) 364 (19.8)
  NA 0 0 0 0 82 ( 4.5)
a_hsp_naaccr (%)
  Hispanic 100 ( 62.5) 114 ( 54.0) 46 ( 48.4) 8 ( 40.0) 86 ( 4.7)
  non-Hispanic white 50 ( 31.2) 74 ( 35.1) 45 ( 47.4) 10 ( 50.0) 84 ( 4.6)
  Other 4 ( 2.5) 18 ( 8.5) 2 ( 2.1) 1 ( 5.0) 14 ( 0.8)
  Unknown 6 ( 3.8) 5 ( 2.4) 2 ( 2.1) 1 ( 5.0) 3 ( 0.2)
  NA 0 0 0 0 1654 (89.8)
a_hsp_strict (%)
  Hispanic 62 ( 38.8) 68 ( 32.2) 27 ( 28.4) 6 ( 30.0) 562 (30.5)
  non-Hispanic white 29 ( 18.1) 64 ( 30.3) 35 ( 36.8) 9 ( 45.0) 53 ( 2.9)
  Other 4 ( 2.5) 12 ( 5.7) 2 ( 2.1) 1 ( 5.0) 84 ( 4.6)
  Unknown 65 ( 40.6) 67 ( 31.8) 31 ( 32.6) 4 ( 20.0) 702 (38.1)
  NA 0 0 0 0 440 (23.9)
a_tdeath (%) 8 ( 5.0) 99 ( 46.9) 30 ( 31.6) 3 ( 15.0) 305 (16.6)
a_tdiag (%) 160 (100.0) 211 (100.0) 95 (100.0) 20 (100.0) 0
a_trecur (%) 0 1 ( 0.5) 83 ( 87.4) 0 41 ( 2.2)
a_tsurg (%) 157 ( 98.1) 113 ( 53.6) 94 ( 98.9) 13 ( 65.0) 113 ( 6.1)
BMI (mean (sd)) 31.19 (8.34) 27.77 (7.26) 29.32 (7.11) 29.66 (9.92) 30.63 (9.31)
Deceased, EMR (%) 7 ( 4.4) 90 ( 42.7) 22 ( 23.2) 3 ( 15.0) 298 (16.2)
Deceased, Registry (%) 1 ( 0.6) 71 ( 33.6) 18 ( 18.9) 3 ( 15.0) 43 ( 2.3)
Deceased, SSN (%) 1 ( 0.6) 12 ( 5.7) 5 ( 5.3) 0 89 ( 4.8)
Diabetes, i2b2 (%) 56 ( 35.0) 54 ( 25.6) 27 ( 28.4) 1 ( 5.0) 585 (31.8)
Diabetes, Registry (%) 31 ( 19.4) 26 ( 12.3) 8 ( 8.4) 0 26 ( 1.4)
Hispanic, i2b2 (%) 92 ( 57.5) 96 ( 45.5) 43 ( 45.3) 7 ( 35.0) 746 (40.5)
Hispanic, Registry (%)
  Non_Hispanic 54 ( 33.8) 92 ( 43.6) 47 ( 49.5) 11 ( 55.0) 98 ( 5.3)
  Unknown 6 ( 3.8) 5 ( 2.4) 2 ( 2.1) 1 ( 5.0) 3 ( 0.2)
  Hispanic_NOS 86 ( 53.8) 96 ( 45.5) 43 ( 45.3) 8 ( 40.0) 67 ( 3.6)
  Mexican 13 ( 8.1) 17 ( 8.1) 1 ( 1.1) 0 17 ( 0.9)
  Spanish_Surname 0 1 ( 0.5) 1 ( 1.1) 0 2 ( 0.1)
  Cuban 1 ( 0.6) 0 0 0 0
  S_Ctr_America 0 0 1 ( 1.1) 0 0
  NA 0 0 0 0 1654 (89.8)
Insurance, Registry (%)
  Not Insured 17 ( 10.6) 21 ( 10.0) 7 ( 7.4) 2 ( 10.0) 17 ( 0.9)
  Self-Pay 22 ( 13.8) 21 ( 10.0) 15 ( 15.8) 0 14 ( 0.8)
  Insurance NOS 1 ( 0.6) 5 ( 2.4) 0 0 1 ( 0.1)
  Managed Care HMO / PPO 56 ( 35.0) 53 ( 25.1) 28 ( 29.5) 10 ( 50.0) 40 ( 2.2)
  Private Fee-for-Svc 0 1 ( 0.5) 0 0 0
  Medicaid 10 ( 6.2) 14 ( 6.6) 1 ( 1.1) 0 10 ( 0.5)
  Medicaid Mgd. Care Pln. 14 ( 8.8) 6 ( 2.8) 6 ( 6.3) 3 ( 15.0) 10 ( 0.5)
  Medicare/Medicaid NOS 13 ( 8.1) 30 ( 14.2) 12 ( 12.6) 1 ( 5.0) 36 ( 2.0)
  Medicare w Suppl. NOS 3 ( 1.9) 2 ( 0.9) 2 ( 2.1) 0 6 ( 0.3)
  Medicare Mgd. Care Pln. 9 ( 5.6) 16 ( 7.6) 7 ( 7.4) 3 ( 15.0) 13 ( 0.7)
  Medicare w Private Suppl. 5 ( 3.1) 22 ( 10.4) 9 ( 9.5) 0 20 ( 1.1)
  Medicare w Medicaid 3 ( 1.9) 5 ( 2.4) 2 ( 2.1) 0 7 ( 0.4)
  TriCare 3 ( 1.9) 1 ( 0.5) 0 0 4 ( 0.2)
  VA 1 ( 0.6) 7 ( 3.3) 1 ( 1.1) 0 3 ( 0.2)
  Unknown 3 ( 1.9) 7 ( 3.3) 5 ( 5.3) 1 ( 5.0) 6 ( 0.3)
  NA 0 0 0 0 1654 (89.8)
Kidney Cancer, i2b2 (%) 152 ( 95.0) 193 ( 91.5) 85 ( 89.5) 17 ( 85.0) 1729 (93.9)
Kidney Cancer, Registry (%) 156 ( 97.5) 204 ( 96.7) 87 ( 91.6) 19 ( 95.0) 20 ( 1.1)
Language, i2b2 (%)
  English 128 ( 80.0) 173 ( 82.0) 84 ( 88.4) 19 ( 95.0) 1588 (86.3)
  Spanish 31 ( 19.4) 29 ( 13.7) 7 ( 7.4) 1 ( 5.0) 213 (11.6)
  Other 0 3 ( 1.4) 0 0 4 ( 0.2)
  Unknown 1 ( 0.6) 6 ( 2.8) 4 ( 4.2) 0 36 ( 2.0)
Marital Status, Registry (%)
  Divorced 13 ( 8.1) 16 ( 7.6) 11 ( 11.6) 0 16 ( 0.9)
  Separated 8 ( 5.0) 2 ( 0.9) 1 ( 1.1) 2 ( 10.0) 6 ( 0.3)
  Married 79 ( 49.4) 125 ( 59.2) 56 ( 58.9) 7 ( 35.0) 102 ( 5.5)
  Domestic Partner 0 0 0 0 0
  Single 39 ( 24.4) 30 ( 14.2) 16 ( 16.8) 9 ( 45.0) 32 ( 1.7)
  Unknown 15 ( 9.4) 24 ( 11.4) 8 ( 8.4) 2 ( 10.0) 17 ( 0.9)
  Widowed 6 ( 3.8) 14 ( 6.6) 3 ( 3.2) 0 14 ( 0.8)
  NA 0 0 0 0 1654 (89.8)
n_cstatus (%)
  Tumor_Free 160 (100.0) 1 ( 0.5) 7 ( 7.4) 0 58 ( 3.2)
  Tumor 0 210 ( 99.5) 81 ( 85.3) 0 114 ( 6.2)
  Unknown 0 0 7 ( 7.4) 20 (100.0) 15 ( 0.8)
  NA 0 0 0 0 1654 (89.8)
Race, i2b2 (%)
  White 149 ( 93.1) 185 ( 87.7) 87 ( 91.6) 19 ( 95.0) 1566 (85.1)
  Black 3 ( 1.9) 10 ( 4.7) 3 ( 3.2) 1 ( 5.0) 95 ( 5.2)
  Asian 3 ( 1.9) 6 ( 2.8) 0 0 13 ( 0.7)
  Pac Islander 0 0 0 0 1 ( 0.1)
  Other 0 3 ( 1.4) 0 0 46 ( 2.5)
  Unknown 5 ( 3.1) 7 ( 3.3) 5 ( 5.3) 0 120 ( 6.5)
Race, Registry (%)
  White 153 ( 95.6) 188 ( 89.1) 91 ( 95.8) 18 ( 90.0) 170 ( 9.2)
  Black 3 ( 1.9) 10 ( 4.7) 2 ( 2.1) 1 ( 5.0) 11 ( 0.6)
  Asian 1 ( 0.6) 3 ( 1.4) 0 0 2 ( 0.1)
  Pac Islander 0 1 ( 0.5) 0 0 0
  Other 0 4 ( 1.9) 0 0 0
  Unknown 3 ( 1.9) 5 ( 2.4) 2 ( 2.1) 1 ( 5.0) 4 ( 0.2)
  NA 0 0 0 0 1654 (89.8)
Sex, i2b2 (%)
  m 100 ( 62.5) 151 ( 71.6) 63 ( 66.3) 13 ( 65.0) 1047 (56.9)
  f 60 ( 37.5) 60 ( 28.4) 32 ( 33.7) 7 ( 35.0) 793 (43.1)
  u 0 0 0 0 1 ( 0.1)
Sex, Registry (%)
  m 98 ( 61.3) 149 ( 70.6) 63 ( 66.3) 13 ( 65.0) 106 ( 5.8)
  f 62 ( 38.8) 62 ( 29.4) 32 ( 33.7) 7 ( 35.0) 81 ( 4.4)
  NA 0 0 0 0 1654 (89.8)

5 Conclusion and next steps

This detailed investigation of the available data elements and development of analysis scripts opens four priority directions: more data, external data, more covariates, and improved pre-processing at the i2b2 end (Aim 1).

More data can be acquired by reclaiming values that are currently inconsistent or missing. There are various ad-hoc consistency checks described in Appendix 3.1, Appendix 3.2.1, Appendix 3.2.2 I need to gather these checks in one place and systematically run them on every patient to get a total count of records that need manual chart review (Dr. Rodriguez’s protocol) and for each record a list of issues to resolve.

To reclaim missing values I will need to solve the problem of lag and disagreement between the EMR and NAACCR (sec. 2.3). I will meet with the MCC NAACCR registrar and learn where exactly in the EMR and other sources she looks to abstract [1880 Recurrence Type--1st][n_rectype], [3170 RX Date--Most Defin Surg][n_rx3170], [1340 Reason for No Surgery][n_surgreason], [0390 Date of Diagnosis][n_ddiag], [1200 RX Date--Surgery][n_dsurg], [1750 Date of Last Contact][n_lc], [1760 Vital Status][n_vtstat], [1770 Cancer Status][n_cstatus], [1860 Recurrence Date--1st][n_drecur], [Kidney and Renal Pelvis][n_seer_kcancer], and [Kidney, NOS][n_kcancer]. I will also meet with personnel experienced in Urology chart review to learn their methods.. This may lead to improvements in the CIRD ETL process. I also plan on adding all ICD codes for ‘renal mass’4 to my i2b2 query (Appendix 3.2.1). Meanwhile, in response to researcher questions including my own, CIRD staff have identified thousands of NAACCR entries and surgery billing records that got excluded from i2b2 because they are not associated with visits to UT Health clinics. After the next i2b2 refresh we expect an increased number of patients and possible improved agreement of event dates between EMR and NAACCR.

For external data I will request non-aggregated limited/deidentified records from the Texas Cancer Registry. I will also look at the NCDB dataset obtained by Urology to see if it has the elements listed in sec. 2.2.

In the remainder of Aim 2 and Aim 3 I will need the following additional variables: (NAACCR only) stage and grade; (EMR only) analgesics, smoking and alcohol, family history of cancer or diabetes, lab results, vital signs, Miperamine (as per Dr. Michalek), frequency of lab and image orders, frequency and duration of visits, and participation in adjuvant trials; (both) birthplace, language, and diabetes; and (census data in i2b2) income and education. Each of these will require a workup similar to that reported in sec. 2 and Appendix 3. I can work independently on many of these but I will need guidance from experts in Urology on interpreting the stage and grade data. If genomic data from the Urology biorepository becomes available for these patients in the course of this study it also will become an important variable for Aim 2.

The use of TCR or NCDB data is not a substitute for UT Health and MGH i2b2 data. The registries allow me to test the replicability of high-level findings to State and National populations but they will not have the detailed additional variables I will need to investigate the causes of disparate patient outcomes.

Nor are the R scripts I wrote for this project a substitute for DataFinisher5 development planned for Aim 1. On the contrary, the reason I was able to make this much progress in one month is that the data linkage and de-identification was done by the CIRD i2b2 ETL, the data selection was simplified by the i2b2 web client, and an enormous amount of post-processing was done by my DataFinisher app that is integrated into our local i2b2. During the work I present here I found several additional post-processing steps that generalize to other studies and I will integrate those into DataFinisher so that the data it outputs is even more analysis-ready. This will, in turn, will simplify the logistics of Aim 3.

While I am incorporating the new methods into DataFinisher, I will also reorganize and document the code so I can present it to Dr. Murphy and his informatics team for review and input.

6 References

 

1. Pinheiro, P. S. et al. High cancer mortality for US-born Latinos: Evidence from California and Texas. BMC Cancer 17, (2017).

2. Murphy, S. et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Research 19, 1675–1681 (2009).

3. Adagarla, B. et al. SEINE: Methods for Electronic Data Capture and Integrated Data Repository Synthesis with Patient Registry Use Cases. (2015).

4. Rodriguez, R. personal communication (2018).

5. Bokov, A., Manuel, L., Cheng, C., Bos, A. & Tirado-Ramos, A. Denormalize and Delimit: How not to Make Data Extraction for Analysis More Complex than Necessary. Procedia Computer Science 80, 1033–1041 (2016).

 

Appendix 1 : Example of stage/grade data

Need to tabulate the frequencies of various combinations of TNM values

Appendix 1.1 Observations about NAACCR staging

3400 Derived AJCC-7 T, 3410 Derived AJCC-7 N, 3420 Derived AJCC-7 M, 2940 Derived AJCC-6 T, 2960 Derived AJCC-6 N, and 2980 Derived AJCC-6 M are missing if and only if 3402 Derived AJCC-7 T Descript, 3412 Derived AJCC-7 N Descript, 3422 Derived AJCC-7 M Descript, 2950 Derived AJCC-6 T Descript, 2970 Derived AJCC-6 N Descript, and 2990 Derived AJCC-6 M Descript are also missing, respectively. For the tables in this section, the counts are by visit rather than by unique patient since the question of interest is how often do the stages assigned to the same case agree with each other. Each of the tables shows the 20 most common combinations of values.

Table II: Frequency of various combinations of 3430 Derived AJCC-7 Stage Grp, 3000 Derived AJCC-6 Stage Grp, 0970 TNM Clin Stage Group, and 0910 TNM Path Stage Group
3430 Derived AJCC-7 Stage Grp 3000 Derived AJCC-6 Stage Grp 0970 TNM Clin Stage Group 0910 TNM Path Stage Group N
- - - - 3810
IV IV 99 99 65
III III 99 3 57
- - 88 88 57
I I 99 99 56
UNK UNK 99 99 55
I I 99 1 43
IV IV 4 99 42
III III 99 99 23
- UNK 99 99 23
IV IV 99 4 22
- - 99 99 17
II II 99 2 13
II II 99 99 13
IV IV 4 4 12
- I 99 1 9
IV IV 99 3 8
I I 1 99 7
- - 4 99 6
- I 99 99 6
Table III: Frequency of various combinations of 3400 Derived AJCC-7 T, 2940 Derived AJCC-6 T, 0940 TNM Clin T, and 0880 TNM Path T
3400 Derived AJCC-7 T 2940 Derived AJCC-6 T 0940 TNM Clin T 0880 TNM Path T N
- - - - 3824
N- N- 88 88 64
cX cX - - 50
p3a p3b - 3A 33
p1a p1a - 1A 30
p1b p1b - 1B 24
p3a p3a - 3A 21
c1a c1a - - 20
pX pX - - 14
- pX - - 13
c4 c4 - - 12
p3b p3b - 3B 10
c1 c1 - - 10
p3 p3 - 3 9
p2a p2 - 2A 8
p3a p3a - 3 8
p1a p1a - - 8
c1b c1b - - 6
c3a c3b - - 6
cX cX X X 5
Table IV: Frequency of various combinations of 3410 Derived AJCC-7 N, 2960 Derived AJCC-6 N, 0950 TNM Clin N, and 0890 TNM Path N
3410 Derived AJCC-7 N 2960 Derived AJCC-6 N 0950 TNM Clin N 0890 TNM Path N N
- - - - 3825
c0 c0 - - 130
N- N- 88 88 64
p0 p0 - 0 54
cX cX - - 46
c0 c0 - X 44
c0 c0 - 0 31
c1 c1 - - 29
cX cX - X 25
- c0 - - 21
- cX - - 16
c0 c0 X X 15
c0 c0 0 - 15
- c0 - 0 14
p1 p1 - 1 8
c0 c0 c0 - 8
c0 c0 c0 c0 7
c0 c0 - pX 7
c0 c0 0 X 7
y0 y0 - 0 5
Table V: Frequency of various combinations of 3420 Derived AJCC-7 M, 2980 Derived AJCC-6 M, 0960 TNM Clin M, and 0900 TNM Path M
3420 Derived AJCC-7 M 2980 Derived AJCC-6 M 0960 TNM Clin M 0900 TNM Path M N
- - - - 3827
c0 c0 - - 310
c1 c1 - - 67
N- N- 88 88 64
- c0 - - 50
c0 c0 0 - 36
c0 c0 c0 c0 24
c1 c1 1 - 24
p1 p1 - - 13
c0 c0 c0 - 9
c0 cX - - 9
c1 c1 - 1 8
p1 p1 - 1 8
c0 c0 - c0 7
- c0 - 0 6
- - c0 - 6
c1 c1 c1 - 6
- - c0 c0 5
- c0 0 - 5
- - c1 - 5

In tables II, III, IV, V, when both the AJCC-7 and AJCC-6 values are non-missing they agree with each other 92.4%, 77.3%, 94.3%, and 94.7% of the time for T, N, and M respectively. There are 31.6%, 22.9%, 22.8%, and 22.6% AJCC-7 values missing but 6.9%, 10.3%, 10.2%, and 10.3% can be filled in from AJCC-6 for T, N, and M respectively.

Table VI: This is proof of feasibility for extracting stage and grade at diagnosis for each NAACCR patient for import into the EMR system (e.g. Epic/Beacon). Clinical and pathology stage descriptors are also available in NAACCR. Here the patient_num and start_date are de-identified but with proper authorization they can be mapped to MRNs or internal database index keys.
patient_num start_date 3400 Derived AJCC-7 T 3410 Derived AJCC-7 N 3420 Derived AJCC-7 M 3430 Derived AJCC-7 Stage Grp
350 2014-05-10 X 0 0 UNK
3442 2014-09-17 is 0 0 0
3442 2015-03-01 1a 0 0 I
9006 2009-09-02 1b 0 0 I
9006 2009-11-18 1b 0 0 I
18576 2011-08-03 1a 0 0 I
18584 2011-06-04 3a 0 0 III
19421 2011-05-12 1b 0 0 I
35354 2010-04-02 3 2NOS 0 IIINOS
35354 2010-04-10 1a 0 0 I
41377 2012-01-05 3a 0 0 III
43065 2013-06-06 3c 1 1 IV
62619 2010-04-17 X 0 0 UNK
89902 2010-01-17 3a 0 0 III
93443 2012-08-21 X 1a 0 UNK
93443 2012-09-09 1a 0 0 I
97742 2010-11-02 3a 0 1 IV
111335 2013-01-19 1 0 0 I
114314 2015-10-27 3b 0 0 III
117341 2011-03-04 X X 0 UNK

 

Appendix 2 : Next steps

All the TODO items are now tracked on to GitHub as well as linked from their respective yellow-highlighted text throughout the document.

 

Appendix 3 Supplementary results

Appendix 3.1 Consistency checks

In this section are patient counts for all 2327 patients in the overall set, broken down by various NAACCR variables (rows) and equivalent EMR variables (columns). The bold values are counts of patients for whom NAACCR and EMR are in agreement. Patients in the NA are the ones with only EMR and no NAACCR records, so they count as missing rather than discrepant.

Table VII: Marital status has good agreement between NAACCR and EMR.
    divorced legally sepa married other significant single unknown widowed Sum
Divorced 0 47 0 2 0 0 5 2 0 56
Separated 0 0 15 3 0 0 1 0 0 19
Married 0 5 3 336 0 0 13 5 7 369
Domestic Partner 0 0 0 0 0 0 0 0 0 0
Single 0 1 2 3 0 0 119 0 1 126
Unknown 0 3 0 8 0 0 32 22 1 66
Widowed 0 0 0 1 0 0 1 0 35 37
NA 1 150 35 887 1 2 423 66 89 1654
Sum 1 206 55 1240 1 2 594 95 133 2327
Table VIII: Sex has good agreement between NAACCR and EMR.
  m f u Sum
m 428 1 0 429
f 9 235 0 244
NA 937 716 1 1654
Sum 1374 952 1 2327
Table IX: Race has good agreement between NAACCR and EMR.
  White Black Asian Pac Islander Other Unknown Sum
White 591 2 2 0 2 23 620
Black 1 26 0 0 0 0 27
Asian 0 0 6 0 0 0 6
Pac Islander 0 0 1 0 0 0 1
Other 1 0 2 0 1 0 4
Unknown 13 1 0 0 0 1 15
NA 1400 83 11 1 46 113 1654
Sum 2006 112 22 1 49 137 2327
Table X: Hispanic designation has good agreement between NAACCR and EMR. Here the 0190 Spanish/Hispanic Origin variable was simplified by binning into Hispanic and non-Hispanic.
  Non_Hispanic Hispanic Sum
Non_Hispanic 304 15 319
Hispanic 56 298 354
NA 983 671 1654
Sum 1343 984 2327
Table XI: As table X but with all the different levels of 0190 Spanish/Hispanic Origin shown.
  Non_Hispanic Hispanic Sum
Non_Hispanic 291 11 302
Unknown 13 4 17
Hispanic_NOS 44 256 300
Mexican 9 39 48
Spanish_Surname 2 2 4
Cuban 1 0 1
S_Ctr_America 0 1 1
NA 983 671 1654
Sum 1343 984 2327
Table XII: Below is a summary of birth_date - 0240 Date of Birth (in years) for the patients with non-matching dates of birth mentioned in sec. 2.1. Though there are only 15 of them those few deviate by multiple years from the EMR records.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-12 -6.5 -3.162 -3.186 -0.7064 9.999

The tables of patients with discrpant birthdates have been removed because the only apply to 15 patients, and are mostly empty. They can still be viewed in the 181009 archival version of this document for marital, sex, race, hisp, and surg

blank

Appendix 3.2 Which EMR and NAACCR variables are reliable event indicators?

For each of the main event variables Diagnosis, Surgery, Recurrence, and Death / 1760 Vital Status there were multiple candidate data elements in the raw data. If such a family of elements is in good agreement overall then individual missing dates can be filled in with the earliest non-missing dates from other data elements in that family (except for mortality where the latest non-missing date would make more sense). But to do this I needed not only to establish qualitative agreement as I did for demographic variables in sec. 2.1 and Appendix 3.1 but also determine how often these dates lag or lead each other and by how much. The plots in this section use the y-axis to represent time for patient records arranged along the x-axis. They are arranged in an order that varies from one plot to another, chosen for visual interpretability. Each vertical slice of a plot represents one patient’s history, with different colors representing events as documented by different data elements. The goal is to see the frequency, magnitude, and direction of divergence for several variables at the same time.

Appendix 3.2.1 Initial diagnosis

At this time only 0390 Date of Diagnosis is usable for calculating Diagnosis. Initially 0580 Date of 1st Contact was considered as an additional NAACCR source along with the earliest EMR records of 189.0 Malignant neoplasm of kidney, except pelvis and C64 Malignant neoplasm of kidney, except renal pelvis. 0443 Date Conclusive DX is never used by our NAACCR. All other NAACCR data elements containing the word ‘date’ seem to be retired or related to events after initial diagnosis. 0580 Date of 1st Contact was disqualified because it never precedes 0390 Date of Diagnosis but often trails behind 1200 RX Date--Surgery, see fig. 11. I will need to consult with a NAACCR registrar about what [0580 Date of 1st Contact][n_fc] actually means but it does not appear to be a first visit nor first diagnosis. As can be seen in fig. 5 and table XIII, the first ICD9 or ICD10 code most often occurs after initial diagnosis, sometimes before the date of diagnosis, and coinciding with the date of diagnosis rarest of all. Several of the ICD9/10 first observed dates lead or trail the 0390 Date of Diagnosis by multiple years.

a

Figure 5: Here is a plot centered on 0390 Date of Diagnosis (blue horizontal line at 0) with black lines indicating ICD10 codes for primary kidney cancer from the EMR and dashed red lines indicating ICD9 codes. The dashed horizontal blue lines indicate +- 3 months from 0390 Date of Diagnosis.

blank
Table XIII: For patients with NAACCR records, how often do ICD9 or ICD10 codes for kidney cancer in the EMR lead or trail 0390 Date of Diagnosis and by how much?
  before +/- 2 weeks after NA Sum
before 29 2 15 1 47
+/- 2 weeks 0 38 34 1 73
after 0 1 316 3 320
NA 0 0 7 39 46
Sum 29 41 372 44 486

For most patients (291), the first EMR code is recorded within 3 months of first diagnosis as recorded by NAACCR. Of those with a larger time difference, the majority (143) have their first EMR code after first 0390 Date of Diagnosis. Only 13 patients have ICD9/10 diagnoses that precede their 0390 Date of Diagnosis by more than 3 months. An additional 54 patients have first EMR diagnoses that precede 0390 Date of Diagnosis by less than three months. These might need to be eliminated from the sample on the grounds of not being first occurrences of kidney cancer. However, we cannot back-fill missing NAACCR records or NAACCR records lacking a diagnosis date because there is too frequently disagreement between the the two sources, and the EMR records are currently biased toward later dates.

I will need to meet with the MCC NAACCR registrar to see how she obtains her dates of initial diagnosis and I will need to do a chart review of a sample of NAACCR patients to understand what information visible in Epic sets them apart from kidney cancer patients without NAACCR records. I will also need to do a chart review of the patients with ICD9/10 codes for kidney cancer that seemingly pre-date their [0390 Date of Diagnosis][n_ddiag]. There are 75 patients with multiple NAACCR records. I will need to learn how NAACCR distinguishes their first occurrences and see if restricting the NAACCR data to just first occurrences will diminish the number of EMR diagnoses preceding those in NAACCR. It will also be helpful to learn whether there is anything in the EMR distinguishes first kidney cancer occurrences besides lack of previous diagnosis.

Appendix 3.2.2 Surgery

To construct the Surgery analytic variable I considered 1200 RX Date--Surgery, 1260 Date of Initial RX--SEER, 1270 Date of 1st Crs RX--CoC, and 3170 RX Date--Most Defin Surg from NAACCR as well as earliest occurrences of V45.73 Acquired absence of kidney, Z90.5 Acquired absence of kidney, or HX NEPHRECTOMY from the EMR. In the plots and tables below I show why I decided to use 3170 RX Date--Most Defin Surg as the surgery date and when that is unavailable, to fall back on 1200 RX Date--Surgery. The other data elements are not used except to flag potentially incorrect records if they occur earlier than the date of diagnosis.

blank

a

Figure 6: Above is a plot of all patients sorted by 1200 RX Date--Surgery (black line). On the same axis is 3170 RX Date--Most Defin Surg (red line) which is almost identical to 1200 RX Date--Surgery except for a small number of cases where it occurs later than 1200 RX Date--Surgery . It never occurs earlier. The violet lines indicate for each patient the earliest EMR code implying that a surgery had taken place (acquired absence of kidney ICD V/Z codes or surgical history of nephrectomy). The blue horizontal line is 0390 Date of Diagnosis with the dashed lines representing a 3-month window in both directions..

a

Figure 7: In the above plot the 1270 Date of 1st Crs RX--CoC (green) and 1260 Date of Initial RX--SEER (cyan) events are superimposed on time till 1200 RX Date--Surgery like in fig. 6 (but violet lines for nephrectomy EMR codes are omitted for readability). The 1270 Date of 1st Crs RX--CoC and 1260 Date of Initial RX--SEER variables trend earlier than 1200 RX Date--Surgery.

blank

In fig. 6 the 5 patients for which the earliest EMR nephrectomy code occurs before the earliest NAACCR possible record of surgery are highlighted in yellow. Among the remaining 181 patients who have an EMR code for nephrectomy, there are 129 for whom it happens more than 3 months after 1200 RX Date--Surgery and those lags have a median of 14.3 months. This level of discrepancy disqualifies V45.73 Acquired absence of kidney, Z90.5 Acquired absence of kidney, and HX NEPHRECTOMY from being used to fill in missing NAACCR dates. This may change after the next i2b2 update in which the fix to the “visit-less patient” problem will be implemented (sec. 5)

blank

a

Figure 8: Above is a plot equivalent to fig. 7 but for patients who do not have a 1340 Reason for No Surgery code equal to Surgery Performed. There are many 1270 Date of 1st Crs RX--CoC and 1260 Date of Initial RX--SEER events but only a small number of 1200 RX Date--Surgery (black) and 3170 RX Date--Most Defin Surg (red). The 1200 RX Date--Surgery and 3170 RX Date--Most Defin Surg that do occur track each other perfectly. Together with NAACCR data dictionary’s description this suggests that 3170 RX Date--Most Defin Surg is the correct principal surgery date in close agreement with 1200 RX Date--Surgery , so perhaps missing 3170 RX Date--Most Defin Surg values can be filled from 1200 RX Date--Surgery . However 1270 Date of 1st Crs RX--CoC and 1260 Date of Initial RX--SEER seem like non-primary surgeries or other events and cannot be used to fill in missing values.

blank
Table XIV: As can be seen in the table below, the variables V45.73 Acquired absence of kidney, HX NEPHRECTOMY, Surgical Oncology, and Z90.5 Acquired absence of kidney sometimes precede 0390 Date of Diagnosis by many weeks but they usually follow 0390 Date of Diagnosis by more weeks than do 3180 RX Date--Surgical Disch and 1200 RX Date--Surgery. Those two NAACCR variables never occur before 0390 Date of Diagnosis and usually occur within 2-8 weeks after it. This is another way of summarizing how much the EMR variables lag behind NAACCR variables.
  Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
3170 RX Date--Most Defin Surg 0 0 3 8.461 9.643 215.1 119
1270 Date of 1st Crs RX--CoC 0 0 2.929 6.431 6.964 318.3 28
1260 Date of Initial RX--SEER 0 0 3.857 8.213 8.571 270.9 198
1200 RX Date--Surgery 0 0 2.857 7.83 9 215.1 109
V45.73 Acquired absence of kidney -361.1 8.143 31.43 69.5 82.71 957.4 261
HX NEPHRECTOMY -91.86 10.11 37.07 77.85 93.96 758.1 318
Surgical Oncology -194.9 0.2143 4.714 23.58 46 236.6 455
Z90.5 Acquired absence of kidney -20.14 9.607 37.86 85.12 111.2 957.4 226
1860 Recurrence Date--1st 0 40.04 73.71 137.2 205.3 935.9 402

It makes sense that the Epic EMR lags behind NAACCR. As an outpatient system, it’s probably recording visits after the original surgery, and perhaps we are not yet importing the right elements from Sunrise EMR. In sec. 5 I outline possible remedies to that. For now, V45.73 Acquired absence of kidney, HX NEPHRECTOMY, Surgical Oncology, and Z90.5 Acquired absence of kidney can still be used to exclude cases as not first-time occurrences if it precedes diagnosis. Would I lose a lot of cases to such a criterion?

Table XV: How often ICD9/10 or surgical history codes for nephrectomy precede diagnosis and by how much
  before same-day after NA
3170 RX Date--Most Defin Surg 0 138 229 119
1270 Date of 1st Crs RX--CoC 0 149 309 28
1260 Date of Initial RX--SEER 0 83 205 198
1200 RX Date--Surgery 0 146 231 109
V45.73 Acquired absence of kidney 3 0 222 261
HX NEPHRECTOMY 3 2 163 318
Surgical Oncology 7 1 23 455
Z90.5 Acquired absence of kidney 1 0 259 226

Only a small number of cases would be disqualified. Another important question is the level of agreement between 1340 Reason for No Surgery and the NAACCR data elements that are candidates for comprising the surgery variable.

Table XVI: Every NAACCR candidate data element (columns) tabulated against 1340 Reason for No Surgery (rows). The bold cells are ones consistent with their respective data elements indicating the primary surgery. The second row is italicized because surgery may still occur as a non-primary course of treatment. Nevertheless the counts in the FALSE columns should be greater than the counts in the TRUE columns for every row except the first. 3170 RX Date--Most Defin Surg and 1200 RX Date--Surgery are in close agreement with each other and have the fewest deviations from expected behavior of a primary surgery data element
  n_rx3170 = FALSE n_rx3170 = TRUE n_rx1270 = FALSE n_rx1270 = TRUE n_rx1260 = FALSE n_rx1260 = TRUE n_dsurg = FALSE n_dsurg = TRUE
Surgery Performed 15 457 13 459 170 302 14 458
Surgery Not First Course 136 10 20 126 82 64 122 24
No Surgery, Contra Indicated 17 1 3 15 10 8 16 2
No Surgery, Deceased 4 0 1 3 2 2 4 0
No Surgery, No Reason Given 5 0 2 3 2 3 5 0
No Surgery, Refused 5 3 2 6 4 4 4 4
Unknown Whether Surgery Done 16 1 11 6 13 4 15 2
Unknown Whether Surgery Recommended or Done 3 0 2 1 2 1 3 0

In summary, based on fig. 6 and table XIII V45.73 Acquired absence of kidney, HX NEPHRECTOMY, Surgical Oncology, and Z90.5 Acquired absence of kidney can only be used to disqualify patients for having erroneous records or previous history of kidney cancer but cannot fill in missing diagnosis dates. Based on figs. 7, 8, and table XVII 1270 Date of 1st Crs RX--CoC and 1260 Date of Initial RX--SEER are not necessarily always surgery events. This leaves 3170 RX Date--Most Defin Surg with 0390 Date of Diagnosis as a fallback. When I meet with the NAACCR regisrar I will seek their feedback about this approach and I will ask them about the most reliable way to identify the first kidney cancer occurrence for a patient if they have several (overlapping?) NAACCR entries. I also need to ask a chart abstraction expert about the best way to find in Epic and in Sunrise the date of a patient’s first nephrectomy

Appendix 3.2.3 Re-occurrence

Candidate data elements for constructing the Recurrence variable were 1770 Cancer Status, 1880 Recurrence Type--1st, and 1860 Recurrence Date--1st from NAACCR. Our site is on NAACCR v16, not v18, so we do not have 1772 Date of Last Cancer Status. According to the v16 standard, 1750 Date of Last Contact should be used instead. From the EMR the candidates were 14 ICD9/10 codes for secondary tumors. In table XVII I reconcile 1770 Cancer Status and 1880 Recurrence Type--1st.

blank
Table XVII: 1770 Cancer Status is in good agreement with 1880 Recurrence Type--1st. Almost all 1770 Cancer Status Tumor_Free patients also have Disease-free in their 1880 Recurrence Type--1st column, the Tumor ones have a variety of values, and the Unknown ones are mostly Unknown if recurred or was ever gone.
  Tumor_Free Tumor Unknown
Disease-free 201 0 0
In situ invasive 0 2 0
In situ original 0 3 0
Local, insufficient info 1 8 0
Local invasive 2 15 0
Regional, insufficient info 0 3 1
Invasive adjacent tissue only 0 3 0
Invasive regional lymph nodes only 0 3 0
Invasive adjacent tissue and regional lymph nodes 0 2 0
Regional in situ, NOS 0 1 0
Multiple true for invasive tumor 0 2 0
Distant, insufficient info 1 16 0
Distant invasive lung only 1 22 1
Distant invasive pleura only 0 1 0
Distant invasive liver only 0 3 0
Distant invasive bone only 1 7 0
Distant invasive CNS only 0 5 0
Distant invasive lymph node only 0 3 0
Distant invasive single site and local/trocar/regional 0 4 0
Distant invasive multiple sites 1 4 0
Never disease-free 0 246 0
Recurred but no other info 0 2 0
Unknown if recurred or was ever gone 0 2 31
blank

1880 Recurrence Type--1st can be simplified by leaving values of Disease-free (0), Never disease-free (70), and Unknown if recurred or was ever gone (99) as they are; if there were multiple values for the same case and one of those values was 70 then defaulting to Never disease-free; and recoding all other values as simply Recurred. I named this analytic variable Recurrence Status.

blank
Table XVIII: Here is the condensed version after having followed the above rules. Looks like the only ones who have a 1860 Recurrence Date--1st are the ones which also have a Recurred status for Recurrence Status (with 19 missing an 1860 Recurrence Date--1st). The only exception is 1 Never diease-free patient with a 1860 Recurrence Date--1st
  Recur Date=FALSE Recur Date=TRUE
1654 0
Disease-free 215 0
Never disease-free 281 1
Recurred 19 124
Unknown if recurred or was ever gone 33 0

This explains why 1860 Recurrence Date--1st values are relatively rare in the data– they are specific to actual recurrences which are not a majority of the cases. This is a good from the standpoint of data consistency. Now we need to see to what extent the EMR codes agree with this.

a

Figure 9: In the above plot, the black line represents months elapsed between surgery and the first occurence of an EMR code for secondary tumors, if any. The horizontal red line segments indicate individual 1860 Recurrence Date--1st . The dotted vertical red lines denote Recurred patients who are missing a 1860 Recurrence Date--1st . The blue horizontal line is the date of surgery and the dotted horizontal lines above and below it are +- 3 months. Patients whose 1880 Recurrence Type--1st is Disease-free are highlighted in green, Never disease-free in yellow, and Recurred in red. There are 75 patients with multiple NAACCR records, and all records for these patients have been excluded from this plot.

blank

The green highlights in fig. 9 are mostly where one would expect, but why are there 38 patients on the left side of the plot labeled Disease-free that have EMR codes for secondary tumors? Also, there are 32 patients with metastatic tumor codes earlier than 1200 RX Date--Surgery and of those 5 occur more than 3 months prior to 1200 RX Date--Surgery. Did they present with secondary tumors to begin with but remained disease free after surgery? These are questions to ask the NAACCR registrar. The EMR codes are in better agreement with 1860 Recurrence Date--1st than the data elements in Appendix 3.2.1 and Appendix 3.2.2 so it might make sense to back-fill the few 1860 Recurrence Date--1st that are missing but first I want to make sure I understand how to reliably distinguish on the EMR side genuine recurrences from secondary tumors that existed at presentation. The small number of cases affected either way lowers the priority of this isuse. For now I will rely only on 1860 Recurrence Date--1st in constructing the analytical variable Recurrence.

Appendix 3.2.4 Death

Unlike diagnosis (Appendix 3.2.1), surgery (Appendix 3.2.2), and recurrence (Appendix 3.2.3) death dates exhibit good agreement between various sources and can be used to supplement the data available from NAACCR.

a

Figure 10: Above are plotted times of death (if any) relative to 0390 Date of Diagnosis (horizontal blue line). The four data sources are Death, i2b2 (), Deceased per SSA (), Expired (), and 1760 Vital Status ().

blank
Table XIX: Date associated with 1760 Vital Status compared to death dates from each source (rows). The first five columns represent the number of patients falling into each of the time-bins (in days) relative to 1760 Vital Status. The last four columns indicate the number of patients for each possible combination of missing values (Left means the variable indicated in the row name is missing and Right means 1760 Vital Status is missing). The parenthesized values below the counts are percentages (of the total number of patients with both variables non-missing for the first five columns and of the total number of patients for the last four columns). Where available, the median difference in days is shown below the count and percentage. This table has only the 486 patients having a kidney cancer diagnosis in NAACCR. The last two rows represent the earliest and latest documentation of death, respectively, from Deceased per SSA, Expired, Death, i2b2, Earliest Death, and Latest Death
  Below
-30
-30 to 0 same 0 to 30 Above
30
Neither
missing
Left
missing
Right
missing
Both
missing
Deceased per SSA 1
(10.0%)
-31.0
0
( 0.0%)
 
9
(90.0%)
0.0
0
( 0.0%)
 
0
( 0.0%)
 
10
( 2.1%)
0.0
83
(17.1%)
 
8
( 1.6%)
 
385
(79.2%)
 
Expired 1
(11.1%)
-34.0
7
(77.8%)
-5.0
1
(11.1%)
0.0
0
( 0.0%)
 
0
( 0.0%)
 
9
( 1.9%)
-5.0
84
(17.3%)
 
8
( 1.6%)
 
385
(79.2%)
 
Death, i2b2 1
( 1.3%)
-31.0
0
( 0.0%)
 
73
(96.1%)
0.0
2
( 2.6%)
5.5
0
( 0.0%)
 
76
(15.6%)
0.0
17
( 3.5%)
 
46
( 9.5%)
 
347
(71.4%)
 
Earliest Death 1
( 1.1%)
-34.0
7
( 7.5%)
-5.0
85
(91.4%)
0.0
0
( 0.0%)
 
0
( 0.0%)
 
93
(19.1%)
0.0
0
( 0.0%)
 
47
( 9.7%)
 
346
(71.2%)
 
Latest Death 0
( 0.0%)
 
0
( 0.0%)
 
91
(97.8%)
0.0
2
( 2.2%)
5.5
0
( 0.0%)
 
93
(19.1%)
0.0
0
( 0.0%)
 
47
( 9.7%)
 
346
(71.2%)
 

In table XIX the sum of the Neither missing and Left missing is always 93 which is the number of deceased patients according to NAACCR records alone. The Right missing column is the number of patients whose deceased status is recorded in the external source but not in NAACCR. For the last two rows Right missing means the total number of deceased patients not recorded in NAACCR but which can be filled in from one or more of the other sources. There are 47 such patients. Finally the last column, Both missing, is the number of patients presumed to be alive because none of the sources have any evidence for being deceased. The Left missing column indicates how many patients are reported deceased in NAACCR but not the other source. Though there are some missing for each individual data source, NAACCR is never the only source reporting them deceased– the values in the bottom two rows are both 0.

The left-side columns of table XIX show the prevalence and magnitude of discrepancies in death dates of the 93 patients that NAACCR and at least one other source agree are deceased. There are at most 10 such patients and for 9 of them the discrepancy is less than one month, with a median difference ranging from -5 to 5.5 days. The small number of discrepancies and the small magnitude of the ones that do occur justify filling in missing NAACCR death dates from the other sources.

Appendix 3.2.5 Whether or not the patient is Hispanic

Despite the overall agreement between 0190 Spanish/Hispanic Origin and Hispanic or Latino there needs to be some way to adjudicate the minority of cases where the sources disagree. The following additional data elements can provide relevant information to form a final consensus variable for analysis: language_cd, Language, Ethnicity, race_cd, and Race (NAACCR 0160-0164) First, each of these variables is re-coded to Hispanic, non-Hispanic, and Unknown.

language_cd and Language are interpreted as being evidence in favor of Hispanic ethnicity if the language includes Spanish. English, ASL, and unknown values are all treated as Unknown ethnicity. However, a language other than the above (e.g. German) is interpreted as evidence for being non-Hispanic.

0190 Spanish/Hispanic Origin already have explicit designations of non-Hispanic and Unknown and all other values are interpreted as Hispanic. Hispanic or Latino is interpreted as Hispanic if TRUE and Unknown if FALSE (in contrast with most of the other elements, there is no way to distinguish a genuinely FALSE value of Hispanic or Latino from a missing one).

Ethnicity is the whole ethnicity variable from i2b2 OBSERVATION_FACT and suprprisingly it sometimes disagrees with Hispanic or Latino. A value of hispanic is interpreted directly. The values other,unknown, unknown/othe,i choose not, and @ are all interpeted as Unknown and any other value (at our site, arab-amer and non-hispanic) is interpreted as non-Hispanic. Rules are then applied to create unified variables from all these data elements. I have three such variables– Hispanic (NAACCR), Hispanic (broad), and Hispanic (strict)

Hispanic (NAACCR) only uses information from NAACCR.

Hispanic (broad) errs on the side of assigning Hispanic ethnicity if there is any evidence for it at all, then non-Hispanic, and Unknown only if there is truly no information from any source about the patient’s ethnicity. In particular, Hispanic is assigned if any non-missing values of language_cd, Language, 0190 Spanish/Hispanic Origin, Hispanic or Latino, and Ethnicity have a value of Hispanic; Unknown if all non-missing values of language_cd, Language, 0190 Spanish/Hispanic Origin, Hispanic or Latino, and Ethnicity are unanimous for Unknown ; and non-Hispanic otherwise.

Finally, Hispanic (strict) only assigns Hispanic if all non-missing values of 0190 Spanish/Hispanic Origin, Hispanic or Latino, and Ethnicity are unanimous for Hispanic. non-Hispanic is assigned if all non-missing values of 0190 Spanish/Hispanic Origin and Ethnicity are unanimous for non-Hispanic (the Hispanic or Latino element is not used for the reasons explained above) and neither Language nor language_cd vote for Hispanic. If neither of these conditions are met, Unknown is assigned.

There is an additional step for patients coded as non-Hispanic where they are further classified into non-Hispanic white and Other. For Hispanic (NAACCR) this is determined by whether or Race (NAACCR 0160-0164) is White. For Hispanic (broad) the criterion is whether at least one of Race (NAACCR 0160-0164) or race_cd is White. For Hispanic (strict) it’s whether both Race (NAACCR 0160-0164) and race_cd are White.

In the end, Hispanic (NAACCR), Hispanic (broad), and Hispanic (strict) all have the same levels, but differ in the proportion of patients assigned to each.

Table XX: The agreement and disagreement between Hispanic (NAACCR), Hispanic (broad), and Hispanic (strict) The bottom 7 rows represent the kidney cancer patients currently without NAACCR records, so for them Hispanic (NAACCR) does not exist.
Hispanic (NAACCR) Hispanic (broad) Hispanic (strict) N Patients
Hispanic Hispanic Hispanic 213
Hispanic Hispanic Unknown 141
non-Hispanic white non-Hispanic white non-Hispanic white 190
non-Hispanic white non-Hispanic white Unknown 59
non-Hispanic white Hispanic Unknown 11
non-Hispanic white non-Hispanic white Other 3
Other Other Other 23
Other Other Unknown 13
Other Hispanic Unknown 2
Other non-Hispanic white Other 1
Unknown Unknown Unknown 9
Unknown Hispanic Unknown 4
Unknown non-Hispanic white Unknown 3
Unknown Other Unknown 1
- Hispanic Hispanic 512
- non-Hispanic white - 440
- Unknown Unknown 363
- Hispanic Unknown 254
- - Other 76
- - Unknown 6
- non-Hispanic white Unknown 3

Of the 673 with NAACCR records (all, not just the 486 meeting the current criteria, see sec. 1) only 22 have differences between Hispanic (NAACCR) and Hispanic (broad) but 229 have differences between Hispanic (NAACCR) and Hispanic (strict).

According to Hispanic (NAACCR), Hispanic (broad), and Hispanic (strict) respectively, 52.6%, 55.1%, and 31.6% of the NAACCR patients are Hispanic. At 55.1% Hispanic (broad) comes the closest to the 2016 Census estimates for San Antonio. Also, anecdotal evidence suggests that Hispanic ethnicity is under-reported. This argues for using Hispanic (broad) when possible, but I will keep Hispanic (strict) available for sensitivity analysis.

 

Appendix 3.3 What is going on with the first contact variable?

a

Figure 11: Wierd observation– 0580 Date of 1st Contact (red) is almost always between 1750 Date of Last Contact (black) and 0390 Date of Diagnosis (blue) though diagnosis is usually on a biopsy sample and that’s why it’s dated as during or after surgery we thought. If first contact is some kind of event after first diagnosis, what is it?.

blank

Surgery 1200 RX Date--Surgery seems to happen in significant amounts both before and after first contact 0580 Date of 1st Contact.

Appendix 3.4 What is the coverage of valid records in each data source.

This section is no longer relevant but is still available for reference in the kidneycancer_181009 snapshot of this document

Appendix 3.5 Which variables are near-synonymous?

This section is no longer relevant but is still available for reference in the kidneycancer_181009 snapshot of this document

 

Appendix 4 Variable descriptions

Here are descriptions of the variables referenced in this document.


patient_num
patient_num :

patient_num


n_rectype
1880 Recurrence Type–1st :

1880 Recurrence Type–1st

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1880


n_rx3170
3170 RX Date–Most Defin Surg :

3170 RX Date–Most Defin Surg; Date of most definitive surgery.

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3170


n_surgreason
1340 Reason for No Surgery :

1340 Reason for No Surgery

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1340


n_ddiag
0390 Date of Diagnosis :

0390 Date of Diagnosis

Link: http://datadictionary.naaccr.org/default.aspx?c=10#390


n_dsurg
1200 RX Date–Surgery :

1200 RX Date–Surgery

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1200


n_lc
1750 Date of Last Contact :

1750 Date of Last Contact; Last Contact

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1750


n_vtstat
1760 Vital Status :

1760 Vital Status; Vital Status, Registry; This gets individually converted to a TTE variable by data.R

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1760


n_cstatus
1770 Cancer Status :

1770 Cancer Status; Cancer Status, Registry

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1770


n_drecur
1860 Recurrence Date–1st :

1860 Recurrence Date–1st

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1860


n_seer_kcancer
Kidney and Renal Pelvis :

Kidney and Renal Pelvis; SEER site


n_kcancer
Kidney, NOS :

Kidney, NOS; KC, Registry


e_surgonc
Surgical Oncology :

Surgical Oncology; Visit to Surgical Oncology; Visit to Surgical Oncology (UT Health)


n_dsdisc
3180 RX Date–Surgical Disch :

3180 RX Date–Surgical Disch

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3180


v037_tnm_pth_dscrptr
0920 TNM Path Descriptor :

0920 TNM Path Descriptor

Link: http://datadictionary.naaccr.org/default.aspx?c=10#920


v055_tnm_cln_dscrptr
0980 TNM Clin Descriptor :

0980 TNM Clin Descriptor

Link: http://datadictionary.naaccr.org/default.aspx?c=10#980


n_a7sg
3430 Derived AJCC-7 Stage Grp :

3430 Derived AJCC-7 Stage Grp

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3430


n_a7md
3422 Derived AJCC-7 M Descript :

3422 Derived AJCC-7 M Descript

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3422


n_a7m
3420 Derived AJCC-7 M :

3420 Derived AJCC-7 M

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3420


n_a7nd
3412 Derived AJCC-7 N Descript :

3412 Derived AJCC-7 N Descript

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3412


n_a7n
3410 Derived AJCC-7 N :

3410 Derived AJCC-7 N

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3410


n_a7td
3402 Derived AJCC-7 T Descript :

3402 Derived AJCC-7 T Descript

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3402


n_a7t
3400 Derived AJCC-7 T :

3400 Derived AJCC-7 T

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3400


n_a6sg
3000 Derived AJCC-6 Stage Grp :

3000 Derived AJCC-6 Stage Grp

Link: http://datadictionary.naaccr.org/default.aspx?c=10#3000


n_a6md
2990 Derived AJCC-6 M Descript :

2990 Derived AJCC-6 M Descript

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2990


n_a6m
2980 Derived AJCC-6 M :

2980 Derived AJCC-6 M

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2980


n_a6nd
2970 Derived AJCC-6 N Descript :

2970 Derived AJCC-6 N Descript

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2970


n_a6n
2960 Derived AJCC-6 N :

2960 Derived AJCC-6 N

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2960


n_a6td
2950 Derived AJCC-6 T Descript :

2950 Derived AJCC-6 T Descript

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2950


n_a6t
2940 Derived AJCC-6 T :

2940 Derived AJCC-6 T

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2940


n_ct
0940 TNM Clin T :

0940 TNM Clin T

Link: http://datadictionary.naaccr.org/default.aspx?c=10#940


n_cn
0950 TNM Clin N :

0950 TNM Clin N

Link: http://datadictionary.naaccr.org/default.aspx?c=10#950


n_cm
0960 TNM Clin M :

0960 TNM Clin M

Link: http://datadictionary.naaccr.org/default.aspx?c=10#960


n_csg
0970 TNM Clin Stage Group :

0970 TNM Clin Stage Group

Link: http://datadictionary.naaccr.org/default.aspx?c=10#970


n_psg
0910 TNM Path Stage Group :

0910 TNM Path Stage Group

Link: http://datadictionary.naaccr.org/default.aspx?c=10#910


n_pm
0900 TNM Path M :

0900 TNM Path M

Link: http://datadictionary.naaccr.org/default.aspx?c=10#900


n_pn
0890 TNM Path N :

0890 TNM Path N

Link: http://datadictionary.naaccr.org/default.aspx?c=10#890


n_pt
0880 TNM Path T :

0880 TNM Path T

Link: http://datadictionary.naaccr.org/default.aspx?c=10#880


n_dob
0240 Date of Birth :

0240 Date of Birth

Link: http://datadictionary.naaccr.org/default.aspx?c=10#240


birth_date
birth_date :

birth_date


n_marital
0150 Marital Status at DX :

0150 Marital Status at DX; Marital Status, Registry

Link: http://datadictionary.naaccr.org/default.aspx?c=10#150


e_marital
Marital Status :

Marital Status; Marital Status, i2b2


n_sex
0220 Sex :

0220 Sex; Sex, Registry

Link: http://datadictionary.naaccr.org/default.aspx?c=10#220


sex_cd
sex_cd :

sex_cd; Sex, i2b2


a_n_race
Race (NAACCR 0160-0164) :

Race (NAACCR 0160-0164); Race, registry; To obtain a combined NAACCR race code for analysis, it is necessary to combine NAACCR variables 0160 Race - 0164 Race into one and then recode it to the closest match among White, Black Asian, Pac Islander, Other, and Unknown


race_cd
race_cd :

race_cd; Race, i2b2


n_hisp
0190 Spanish/Hispanic Origin :

0190 Spanish/Hispanic Origin; Hispanic Origin, Registry

Link: http://datadictionary.naaccr.org/default.aspx?c=10#190


e_hisp
Hispanic or Latino :

Hispanic or Latino; Hispanic Origin, i2b2


e_death
Death, i2b2 :

Death, i2b2; Death, i2b2; Death according to the combined i2b2 records from all sources


s_death
Deceased per SSA :

Deceased per SSA; Death, SSN


e_dscdeath
Expired :

Expired; Discharge Disposition


n_brthplc
0250 Birthplace :

0250 Birthplace

Link: http://datadictionary.naaccr.org/default.aspx?c=10#250


n_mets
2850 CS Mets at DX :

2850 CS Mets at DX

Link: http://datadictionary.naaccr.org/default.aspx?c=10#2850


n_fc
0580 Date of 1st Contact :

0580 Date of 1st Contact; Can also be date of clinical (as opposed to path) diagnosis

Link: http://datadictionary.naaccr.org/default.aspx?c=10#580


n_mult
0446 Multiplicity Counter :

0446 Multiplicity Counter

Link: http://datadictionary.naaccr.org/default.aspx?c=10#446


a_tdeath
Death :

Death; Death


a_hsp_strict
Hispanic (strict) :

Hispanic (strict); Hispanic (strict); Code patients as Hispanic or non-Hispanic only if all available evidence is unanimous, otherwise err on the side of Unknown


a_hsp_broad
Hispanic (broad) :

Hispanic (broad); Hispanic (broad); Code patients as Hispanic if there is even the slightest evidence they are, otherwise assume they re non-Hispanic, and only if there is really zero evidence either way return Unknown


a_tdiag
Diagnosis :

Diagnosis; Diagnosis


a_trecur
Recurrence :

Recurrence; Recurrence; Analytic master variable for time to recurrence. Based on n_drecur


a_tsurg
Surgery :

Surgery; Surgery


a_hsp_naaccr
Hispanic (NAACCR) :

Hispanic (NAACCR); Hispanic, registry; The n_hisp variable binned to Hispanic, non-Hispanic, and Unknown


a_n_recur
Recurrence Status :

Recurrence Status; Recurrence Status; This is the main analytic variable for recurrence. This is based on n_rectype but with all values that signify recurrence binned together leaving Unknown if recurred or was ever gone,Never disease-free,Disease-free, and Recurred.


start_date
start_date :

start_date


e_kc_i9
189.0 Malignant neoplasm of kidney, except pelvis :

189.0 Malignant neoplasm of kidney, except pelvis; KC ICD9, i2b2; 189.0 Malignant neoplasm of kidney, except pelvis


e_kc_i10
C64 Malignant neoplasm of kidney, except renal pelvis :

C64 Malignant neoplasm of kidney, except renal pelvis; KC ICD10, i2b2; C64 Malignant neoplasm of kidney, except renal pelvis


n_rx1260
1260 Date of Initial RX–SEER :

1260 Date of Initial RX–SEER; Date of initiation of the first course therapy for the tumor being reported, using the SEER definition of first course. See also Date 1st Crs RX CoC [1270].

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1260


n_rx1270
1270 Date of 1st Crs RX–CoC :

1270 Date of 1st Crs RX–CoC; Date of initiation of the first therapy for the cancer being reported, using the CoC definition of first course. The date of first treatment includes the date a decision was made not to treat the patient.

Link: http://datadictionary.naaccr.org/default.aspx?c=10#1270


e_i9neph
V45.73 Acquired absence of kidney :

V45.73 Acquired absence of kidney; V45.73 Acquired absence of kidney


e_i10neph
Z90.5 Acquired absence of kidney :

Z90.5 Acquired absence of kidney


e_hstneph
HX NEPHRECTOMY :

HX NEPHRECTOMY; Surgical history


v008_scndr_nrndcrn_inactive
C7B-C7B Secondary neuroendocrine tumors (C7B) :

C7B-C7B Secondary neuroendocrine tumors (C7B); C7B-C7B Secondary neuroendocrine tumors (C7B)


v009_mlgnt_unspcfd
C79 Secondary malignant neoplasm of other and unspecified sites :

C79 Secondary malignant neoplasm of other and unspecified sites; C79 Secondary malignant neoplasm of other and unspecified sites


v009_mlgnt_unspcfd_inactive
C79 Secondary malignant neoplasm of other and unspecified sites :

C79 Secondary malignant neoplasm of other and unspecified sites; C79 Secondary malignant neoplasm of other and unspecified sites


v010_rsprtr_dgstv
C78 Secondary malignant neoplasm of respiratory and digestive organs :

C78 Secondary malignant neoplasm of respiratory and digestive organs; C78 Secondary malignant neoplasm of respiratory and digestive organs


v010_rsprtr_dgstv_inactive
C78 Secondary malignant neoplasm of respiratory and digestive organs :

C78 Secondary malignant neoplasm of respiratory and digestive organs; C78 Secondary malignant neoplasm of respiratory and digestive organs


v011_unspcfd_mlgnt
C77 Secondary and unspecified malignant neoplasm of lymph nodes :

C77 Secondary and unspecified malignant neoplasm of lymph nodes; C77 Secondary and unspecified malignant neoplasm of lymph nodes


v011_unspcfd_mlgnt_inactive
C77 Secondary and unspecified malignant neoplasm of lymph nodes :

C77 Secondary and unspecified malignant neoplasm of lymph nodes; C77 Secondary and unspecified malignant neoplasm of lymph nodes


v012_unspcfd_mlgnt
196 Secondary and unspecified malignant neoplasm of lymph nodes :

196 Secondary and unspecified malignant neoplasm of lymph nodes; 196 Secondary and unspecified malignant neoplasm of lymph nodes


v012_unspcfd_mlgnt_inactive
196 Secondary and unspecified malignant neoplasm of lymph nodes :

196 Secondary and unspecified malignant neoplasm of lymph nodes; 196 Secondary and unspecified malignant neoplasm of lymph nodes


v013_rsprtr_dgstv
197 Secondary malignant neoplasm of respiratory and digestive systems :

197 Secondary malignant neoplasm of respiratory and digestive systems; 197 Secondary malignant neoplasm of respiratory and digestive systems


v013_rsprtr_dgstv_inactive
197 Secondary malignant neoplasm of respiratory and digestive systems :

197 Secondary malignant neoplasm of respiratory and digestive systems; 197 Secondary malignant neoplasm of respiratory and digestive systems


v014_mlgnt_spcfd
198 Secondary malignant neoplasm of other specified sites :

198 Secondary malignant neoplasm of other specified sites; 198 Secondary malignant neoplasm of other specified sites


v014_mlgnt_spcfd_inactive
198 Secondary malignant neoplasm of other specified sites :

198 Secondary malignant neoplasm of other specified sites; 198 Secondary malignant neoplasm of other specified sites


NA
NA :

NA
NA :

language_cd
language_cd :

language_cd; Language, i2b2


e_lng
Language :

Language


e_eth
Ethnicity :

Ethnicity; EMR demographics


 

v055_tnm_cln_dscrptr

Test section

Appendix 5 Audit trail

sequence time type name hash
0001 2018-10-16 17:22:29 info sessionInfo -
0002 2018-10-16 17:22:29 this_script exploration.spin.Rmd 4dff158
0003 2018-10-16 17:23:10 rdata .depdata[ii] = “dictionary.R.rdata” dbb49fe969d73218eddfdbe85670344e
0004 2018-10-16 17:26:03 rdata .depdata[ii] = “data.R.rdata” b9233974e7a29b4c5d27a1603013438d
0003.0001 2018-10-16 17:22:34 info sessionInfo -
0003.0002 2018-10-16 17:22:34 this_script dictionary.R 4dff158
0003.0003 2018-10-16 17:22:35 file inputdata = “local/in/HSC20170563N_kc_v200.int.csv” caa0a30bd87cd77659b118986cab73a4
0003.0004 2018-10-16 17:22:46 file inputdata = “local/in/HSC20170563N_kc_v200.int.csv” caa0a30bd87cd77659b118986cab73a4
0003.0005 2018-10-16 17:22:46 file rawdct = “local/in/meta_HSC20170563N_kc_v200.int.csv” 77226290495672d030798e64327fe10a
0003.0006 2018-10-16 17:22:46 file tpldct = “datadictionary_static.csv” dc40ce6053d4edc459cb6a240f1cf8c6
0003.0007 2018-10-16 17:22:49 info sessionInfo -
0003.0008 2018-10-16 17:22:49 save save -
0004.0001 2018-10-16 17:23:15 info sessionInfo -
0004.0002 2018-10-16 17:23:15 this_script data.R 4dff158
0004.0003 2018-10-16 17:23:26 rdata .depdata = “dictionary.R.rdata” dbb49fe969d73218eddfdbe85670344e
0004.0004 2018-10-16 17:23:26 file levels_map_file = “levels_map.csv” dade16a6df40d86457f024f52781e3b2
0004.0005 2018-10-16 17:24:07 seed project_seed -
0004.0006 2018-10-16 17:25:35 info sessionInfo -
0004.0007 2018-10-16 17:25:37 save save -
0004.0003.0001 2018-10-16 17:22:34 info sessionInfo -
0004.0003.0002 2018-10-16 17:22:34 this_script dictionary.R 4dff158
0004.0003.0003 2018-10-16 17:22:35 file inputdata = “local/in/HSC20170563N_kc_v200.int.csv” caa0a30bd87cd77659b118986cab73a4
0004.0003.0004 2018-10-16 17:22:46 file inputdata = “local/in/HSC20170563N_kc_v200.int.csv” caa0a30bd87cd77659b118986cab73a4
0004.0003.0005 2018-10-16 17:22:46 file rawdct = “local/in/meta_HSC20170563N_kc_v200.int.csv” 77226290495672d030798e64327fe10a
0004.0003.0006 2018-10-16 17:22:46 file tpldct = “datadictionary_static.csv” dc40ce6053d4edc459cb6a240f1cf8c6
0004.0003.0007 2018-10-16 17:22:49 info sessionInfo -
0004.0003.0008 2018-10-16 17:22:49 save save -

  1. UT Health San Antonio