Last quarter for my Advanced AI class, I performed some machine learning experiments on the Surveillance Epidemiology and End Results (SEER) database. It was my first in-depth study using machine learning and I was particularly primed for the topic having just read The Signal and the Noise by Nate Silver. While Nate does not specifically address machine learning, he is a clear supporter of Bayesian-based statistics, so the topic was apropos.
My biggest takeaway was perhaps that applying the complex machine-learning algorithm is the easy part thanks to established software libraries and toolkits. Preparing the data for analysis and understanding the results are the most time consuming and complicated aspects of the task. I worked closely with an oncologist, who did most of the heavy-lifting with the clinical analysis.
I used the Weka software to run the algorithms and I created a small Python script (available on Github) to prepare the data. It should, at minimum, allow somebody to reproduce my results. I had grandiose plans of writing a true SEER library, but unfortunately, it’s custom tailored to the problem I was researching, so it’s not library-packaged at all. Feel free to improve it 🙂
I had at one point hoped to publish this paper, but it doesn’t look like I’ll have time in the foreseeable future. So, I’m self-publishing 🙂 But seriously, this study has not been peer-reviewed in a formal journal manner. I’m posting here because: 1) I spent a lot of time on this project and 2) hopefully someone might come across this and might inspire them to look further into the problem.
Here is the abstract:
Prognosis for stage IV (metastatic) breast cancer is difficult for clinicians to predict. This study examines the SEER data set from 1988-2003 and selects patients who were initially diagnosed with stage IV breast cancer and who have died from a direct result of the cancer. After developing a SEER conversion utility, seer2arff, we create three predictive models that use a supervised, passive, offline technique to classify prognosis (survival time). The results of the algorithms from the Weka toolkit were: Bayes Network, 64.2% accurate; J4.8 Decision Tree, 63.5% accurate; and an Artificial Neural Network, 62.9% accurate. The J4.8 Decision Tree selected attributes that confirm the rationale of ongoing clinical studies. This study is the first to apply machine learning techniques to this category of patients with the SEER data set.