As a community, we have to do the really hard work and collect primary data. It’s time for us to stop doing big data fishing expeditions and taking the easy way out.
—David F. Penson, MD
In a move that reverberated through much of the cancer research community, the National Cancer Institute (NCI) recently announced that it had removed all prostate-specific antigen (PSA) data from its current Surveillance, Epidemiology, and End Results (SEER) data submission and associated SEER-Medicare programs. This action was in response to problems that included the reporting of inaccurate PSA values and misinterpretation of PSA variables.1
Speaking with The ASCO Post, David F. Penson, MD, Director of the Center for Surgical Quality and Outcomes, Department of Urologic Surgery, at Vanderbilt University, Nashville, cautioned, “Withdrawal of these data from SEER will have major impacts on the field of prostate cancer research and for administrators within the health-care community.”
Researchers Need to Rethink Large Databases
Initiated in 1973 by the NCI, the SEER program is one of the oldest and most trusted registries in the world. The SEER program is legislatively mandated to collect population-based cancer data from 17 regions across the United States, representing approximately 28% of the nation’s population. According to Dr. Penson, before pursuing analyses using SEER and SEER-Medicare programs, the data will have to be redesigned in light of the detected problems.
“Journals will not be able to accept SEER studies that rely on the PSA data as a primary variable of interest, including those that use PSA in risk-stratification systems to adjust for confounding or in cohort identification,” said Dr. Penson.
The NCI is currently reviewing the entire data set and implementing protocols to ensure the quality of the data in the future. But Dr. Penson pointed out that the greater problem is the impact that the flawed PSA data have on the existing urologic literature.
“SEER and SEER-Medicare data have been used to address a variety of clinical issues in prostate cancer, and many of the papers written on subjects ranging from comparative effectiveness of treatments, to what patients can expect for outcomes, to issues surrounding PSA screening, and so on, have been based on SEER data. So, now their results come into question,” stressed Dr. Penson.
Loss of Trust Needs to Be Addressed
Dr. Penson raised a larger issue for health-care analyses beyond PSA data: If the data from SEER—one of the most highly regarded registries in the world—are problematic, it calls into question all other large data sets, such as those used by Medicare or Medicaid, the nation’s largest insurers. Despite this recent discovery, he noted, the SEER databases offer valuable information when answering difficult clinical and health-care policy issues. “These data banks have real-world longitudinal data from large numbers of patients that are highly generalizable, giving us answers that we could never get in prospective studies of prostate cancer patients,” said Dr. Penson.
That said, Dr. Penson called into question the data-collection process. “I know that SEER and SEER-Medicare collect breast cancer data on hormone status, so what are the ramifications if those data are not collected properly? We live in this era of large powerful data sets, and they have a tendency to seduce us by their size and sheer amount of information. But we also forget some of the basics of clinical research and how these data are collected,” he said.
“For instance, in the case of SEER PSA, the program is mandated to collect data (such as Gleason scores) at the individual sites by registrars, and I think that the administrative data on the utilization of tests is very good,” he said. “The problem with these studies is that they rely on clinical characteristics, such as PSA values and comorbidity indices, which these data sets are not truly designed to collect. After all, these data points are not necessary for payment, and the law does not mandate their collection.”
Primary Data Needed
Dr. Penson continued, “These large administrative data sets have tremendous value for our field if we use them properly, and that is the key. We have to stop publishing secondary data analyses from these large administrative data sets just because the data are relatively easy to obtain and analyze.”
He noted that we should not expect these data sets to answer questions that they are not designed to address. “What happened with SEER is not NCI’s fault, nor is it the registries’ fault. They started out on a mandated task to study the incidence, prevalence, and general outcomes in cancer. But as a clinical research community, we’ve taken these data and expected them to answer tough clinical questions. And as time passed, we’ve actually become quite glib about the data,” he commented.
“There are researchers in the community who buy a data set and run a single model with 20 variables and write 20 separate papers around that one data set,” Dr. Penson said. “This is done without first questioning whether these data have the power to answer that many questions. We need to reserve these data sets for research questions that they can answer in a valid and reliable manner.”
Dr. Penson ended with a cautionary comment: “As a community, we have to do the really hard work and collect primary data. It’s time for us to stop doing big data fishing expeditions and taking the easy way out.” ■
Disclosure: Dr. Penson reported no potential conflicts of interest.
1. National Cancer Institute: PSA values and SEER data: SEER data, 1973–2012 (November 2014 submission). Available at seer.cancer.gov/data/psa-values.html. Accessed June 8, 2015.