Dr. Harrell is the founding chair of the Vanderbilt Biostatistics department. Since 2003 he has been Professor of Biostatistics, Vanderbilt University School of Medicine, and was the department chairman from 2003-2017. He is Expert Statistical Advisor for the Office of Biostatistics for FDA CDER. He is Associate Editor of Statistics in Medicine, a member of the Scientific Advisory Board for Science Translational Medicine, and a member of the Faculty of 1000 Medicine. He is a Fellow of the American Statistical Association and winner of the Association's WJ Dixon Award for Excellence in Statistical Consulting for 2014. His specialties are development of accurate prognostic and diagnostic models, model validation, clinical trials, observational clinical research, cardiovascular research, technology evaluation, pharmaceutical safety, Bayesian methods, quantifying predictive accuracy, missing data imputation, and statistical graphics and reporting.
+ Full Transcript
John Bailer: At some point in your life you or someone you love will have an illness that you hope will be addressed with a medical treatment. The decision to use this treatment will be based on medical research, you hope. The reliability of medical research will be the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I’m John Bailer. Stats and Stories is a production of Miami University’s Departments of Statistics and Media, Journalism and Film, as well as the American Statistical Association. Joining me in the studio is regular panelist Richard Campbell of Media, Journalism and Film. Our regular moderator Rosemary Pennington is not available today. Our guest today is Frank Harrell. Harrell is the founding chair of the Vanderbilt Biostatistics department, as well as Expert Statistical Advisor for the office of Biostatistics for the Center for Drug Evaluation and Research at the FDA. He has written impactful books on modeling software packages to facilitate analyses, and hundreds of scientific papers, and we are delighted to have him join us today. Frank, thank you for being here.
Frank Harrell: Thanks for having me.
Bailer: Let me start by asking what first attracted you to working in biomedical research?
Harrell: Well, that’s a good question, because like so many things in our lives you stumble upon something, or things happen by random chance. So, I was bored after finishing my sophomore year in high school, and it was Summer, and I was looking for something to do, and my mother suggested I volunteer at the Veteran’s Hospital in Birmingham, Alabama. And I started helping cart patients around in wheelchairs. But the group that I was working with, which was Gastroenterology, they also had a research lab, so they introduced me to what their research was about, which was about the esophagus and the stomach, and especially looking at pressure waves to study the peristalsis and contractions in the esophagus. And they had a lot of data they needed help with meeting the standard deviations, so I got interested in biomedical research because of that accidental stumbling upon gastroenterology in Birmingham.
Bailer: And you were doing data analysis even then?
Harrell: Even then as a high school student. I didn’t know what I was doing, but I started looking at Snedecor and Cochran’s book, learning just enough to be very dangerous. Then something remarkable happened, which is the head of biostatistics at UAB, David Hearst, had an open-door policy, and this student shows up in his office and he’s willing to help. And then I ended up going to college at UAB and taking several graduate bio stat courses from UAB when I was an undergrad. Then I got interested in physiology because I was interested in physics. So, I started combining physiology and biology with statistics and really enjoyed that combination.
Bailer: Sounds like the perfect jumping off point to go into biostatistics.
Harrell: Yes, and the advice that David Hearst gave me when I was thinking about graduate school was the best advice that I ever took. I’ve had some good advice, I didn’t take, but I took his advice and he was way ahead of his time. Dave said, go to the University of North Carolina get a PhD in biostatistics and a supporting program in computer science and biomedical engineering. It was unbelievable advice. What I did was bio stat UNC and my supportive program was biomedical engineering and physiology. So, the advice David Hearst gave me in Birmingham was just so good I can’t even hardly talk about it.
Richard Campbell: So, what would you say today to a student who wanted to do your job? And you got to him early, what would you tell him to do to prepare for the career you have?
Harrell: Well, I think your world view has something to do with it, because if you have a world view that you don’t take things at face value, but have a degree of skepticism at everything you encounter, which I was always that way, I think that portends well for our life as a statistician, but of course you need a good math background. You need to really like math, but you really need like to ask questions. So, skepticism and curiosity in equal doses, with math and quantitative skill is a great combination.
Campbell: It’s interesting you say that. Those are the things I think that we want from our journalism students as well. Skepticism and just without the math part.
Bailer: I was going to say we have skills too here Richard come on.
Campbell: This is partly why we’re doing a podcast is to encourage our students in journalism to know a little bit more about statistics and numbers, because they’re the ones that are going to write the stories.
Harrell: Yeah.
Bailer: So, Frank a lot of our work has been in medical context in the departments that are either associated with schools of medicine, and departments of biostatistics in these different places. What has that been like and how did that lead to your engagement with the FDA?
Harrell: Well I’ve always liked working with physician researchers and I’ve found them great to work with. Medicine has a lot of data. So, before we had the data expansion you see now with the internet and everything, there was already a lot of data in medical research. So, having data around was a great attraction for me, because I always like to have useful applications of statistics that really mattered, and medicine matters because it’s involving making people live longer or feel better. So, I think those that work in medical statistics- you have a little extra spring in your step. If you’re dealing with public health, environmental health, medicine, and there’s many other good and worthwhile fields, but medicine is a field I can feel very good about. So, a combination of the type and quantity of data, the cause, and then the physician investigators are just exceptional collaborators.
Campbell: I read in something on your blog, which it certainly wasn’t written for someone with my background, but there was a sentence that struck me. You talked about the failure to attract variables that are available in medical practice, and I wondered kind of what that was. And you said that sometimes that’s intentional, and I think this has to do with our randomized clinical trials, but I didn’t know what those variables were that you would attract in medical practice. Can you talk a little bit about that?
Harrell: Richard, Think the context you’re speaking of is a context where the person who are doing the research is biased in favor in o a certain technology, and they make themselves not know something, so that what they’re evaluating as a new technology has a lot of new information value to it. And so we see this very often in genetics research, and in biomarker research, where somebody will have a new biomarker or could be a molecular signal from genes, or proteins, or microbiome, and they are trying to show that this new technology has signals in it that are informative for diagnosing, treating, or following patients for their prognosis, and they refuse to adjust for, or take into account the information that a clinician would already have. One review article by [] gave a egregious example where cancer biomarkers are trying to show that they added new value for prognosis in certain cancers, and this was cancer such as lung cancer where the treatment is usually surgery, and a key variable is when a surgeon takes out part of a lung how much of the tumor was unable to be removed. It might be invading the spinal cord or something you can’t remove it. So the residual tumor volume is a key predictor of ultimate outcome of aphasia, and many studies refuse to adjust for that even though it was measured because if you measure the residual tumor volume it would be such a powerful prognostic factor that these new high tech measurements would not really add to that. So, they dumbed down the analysis to make the new measurements look good.
Campbell: Very good. That’s a good example that I understood. Thank you very much.
[Laughter]
Bailer: So, is that one of those examples of the problems of concern that you voice in terms of the reliability of biomedical research?
Harrell: That one has a little bit to do with reliability, a little more to do with bias and overstating the value of new measurements. But there’s other ways that reliability or unreliability manifests itself. So I think many of us are getting familiar with the term p-hacking, which you could have called it analysis to a forgone conclusion, are finding analyses that justify a certain belief, sounds like the way politicians tend to – they say this is what I want to say now get me the data to back it up.
[Laughter]
Harrell: So, the idea of doing analyses until you find an answer that’s publishable or something that will give a press release, make you famous, is all too common in research. There are other kinds of unreliability that’s not quite so obvious, where someone will just- they’ll rationalize away their original plan and they’ll say they thought this drug was going to lower blood pressure. Instead of that- it didn’t do that, so we really wanted to see if it lowers heart attack risk. So, changing the goal post, which is what Andrew Gelman calls the garden forking paths, it’s just another form of overaggressive analysis. Or too many investigator degrees of freedom. So if you give an investigator too much freedom, and you don’t have a real specific statistical plan, human nature being what it is, people will say, oh I made a mistake in the original plan, so I want to change the plan to do this other analysis. And of course if the statistician wanted to get the attention of the investigator, the statistician could say well, you know this analysis that you published six months ago had a p-value 0.03, I think maybe there was a problem with that plan and we should reevaluate it. You wouldn’t see the acceptance of the reevaluation if it was a positive finding the first time.
Campbell: I listened to the webinar you did earlier this year, and you talked about the uses of interactive graphics in work, and I looked at some of that and it was really fascinating. But you’re also critical of statisticians, I think, and maybe scientists maybe who throw out the old methods that work. So, I’m kind of interested in a combination here of what stuff is working and what’s the promise of the interactive graphics you’re interested in, it was just fascinating to see what was available there in your webinar.
Harrell: That’s actually a very difficult question, Richard, in that I think there’s some old style graphics, static graphics, that are really informative and beautiful, and of course some of the greatest examples are in Edward Tufte’s books, but he has an example of a beautiful graphic that has a little flap of paper in one of the corners, and you open the flap and you get another level of detail. That’s the paper version of this sort of interactive graphic I was showing. You can do that with older technology, but the newer technology gives you more options and it’s easier to execute. I think static graphics have a lot to offer. Also, I think what the idea of throwing out old methods has come to the floor even more, is the current excitement about machine learning, where every time someone studies the older methods since [] and Pearson and so on, of regression analysis, the regression analysis looks pretty darn good. So somebody will say, we need to throw some machine learning algorithm at some data, and if it’s not the kind of data that’s really primed for machine learning such as image recognition applications is great for machine learning, then often the new methods don’t have any advantage over the older and sometimes newer methods actually work worse.
Bailer: I’d like to just follow up quickly on your comment on the idea of these secondary endpoints that might all the sudden elevated of primary interest after looking at data. Don’t these trials require some registration of what’s being done and what’s going to be the focus of the study?
Harrell: Yes, to have scientific validity you need to make the study not look like a fishing expedition. So, there needs to be a plan, there needs to be some level of pre-specification. It can go too far though, so when you listen to the discussion among investigators of what’s going to be the primary endpoint, coprimary end point, secondary endpoint, co-secondary... they make all these designations and they don’t always make any sense because obviously the degree of primacy of an end point is chosen by what the statistical power is for that particular endpoint instead of making what’s most important to the patient be the primary endpoint. And the other thing that drives this discussion, to get more technical, to really understand this, is what’s called alpha spending or control of type one error probability. So in the traditional designs unlike the newer Bayesian designs, there’s a wish to control some sort of overall family wise type one error probability, so you start allocating the alpha, so you’re total probability of making a claim of some treatment effect is 0.05 no matter how many comparisons you did, or how many endpoints you examined. The Bayesians would look at this in an incredibly different way. Which is that we want to access the evidence for any question and the evidence for one is not tilted by the evidence for another question. So, it’s a dramatically different way to think about evidence, and the attempt to preserve the overall false alarm rate of type one error is at the heart of a lot of confusion, and in some case some very arbitrary designations of primary and secondary analyses.
Bailer: You’re listening to Stats and Stories, and today we’re talking with Frank Harrell of Vanderbilt University about biomedical research. Frank, I’d like to change gears just a little bit and ask you about your very impactful social media presence. You’re a blogger and you have an active Twitter account, and lots of people are following your insights. I like your statistical thought of the day, that you had one recently that said machine learning is to statistical models as precision medicine is to standard clinical information. So, I’ve got a two-part question. First, what led you to think about engaging with social media to this degree, and secondly, can you help unpack that statistical thought that I just reported?
Harrell: Great questions. The thing about social media that has become a complete shock to me because I’m one of these few people who boycott Facebook and won’t use it for any purpose.
Campbell: Hey join me, I’m there with you.
Harrell: Alright!
Campbell: And John is too.
Harrell: I personally think Facebook has harmed our society in ways that very few foreign powers would be able to accomplish, but that’s for another day. I thought I’d never be on Twitter or have a blog. I got the urge to have a blog mainly because I’m frustrated with the publication model, and how slow journals are, and how slow peer review is and also I am very alarmed at the predatory profit-oriented polishing houses, and blogs, you an be very informal and you can get the message out very quickly, and then when someone points out a mistake you can fix the darn thing, you don’t have to declare it a final copy. So, I like the idea of having something that’s more dynamic. And then I found out you can’t have a blog without being on Twitter because then nobody will know about the blog unless you tweet about it. So, Twitter was a total surprise. I’m still shocked. But it’s really made the blog more successful. But what’s really a shock to me as to educational environment, I have learned more from Twitter, I probably learned ten times from twitter than I ever thought I would learn. People are always pointing out to me either flaws in my logic, or here’s a paper you missed, here is a handout from somebody’s course that you missed, here is a pre-print or an archive you didn’t know about, here’s an upcoming talk, watch out for this because it’s going to be relevant to your research. The amount of information I get taught to me alerted to other web source publications on Twitter has been stunning.
Campbell: I’ve also seen you criticize Twitter though, and I think it led to the blog right? Or am I wrong? You had some criticisms of Twitter do you want to talk about that a little bit? Harrell: I don’t remember criticizing Twitter, per say.
Campbell: I think it was about the length of the tweets, I read on your blog.
Harrell: Yeah, there are some technical things I was criticizing. I mean originally the length was so restricted I had a hard time getting a thought in there, but then I think sometimes even gifted educators don’t use Twitter very effectively because they start Tweetorials. And each chunk is well thought out and informative. But if you come back to it a day later it gets interrupted by a lot of other things, because it takes multiple tweets, so I’ve been urging people to put Tweetorials more as a cohesive topic in a single stream on our new discussion board datamethods.org and then you can tweet that out to say look there and also comment there because the comments will be recorded in one place right after your topic. So, the main concern I had about tweets, there was just people breaking them up into little pieces that get hard to connect to each other.
Bailer: I see. Well, I don’t think that Richard or I expected to be doing podcasts ten years ago either. The kind of things that you get drawn into, and I think part of the things that I’ve been seeing and I think part of the reason you joined Twitter and the things you’ve been tweeting and vlogging about there’s a great deal of exposition. I mean you’re talking about the important ideas that you want others to try to understand and follow. What’s one of the most important ideas recently that you’ve thought about and that you’ve wanted to weigh in on?
Harrell: Well, I’m always weighing in about machine learning that’s not done well. So that’s been kind of a theme that’s been going on for several months. Also, sort of a constant theme is the overhype of precision medicine. And then a lot of my tweets relate to trying to get people to understand the evidence on a scale that’s more relevant. And so how do you understand what a reversed conditional is? So, the rules of probability and what is it you’re actually interested in calculating dictates your choice of method. So, frequentist versus Bayesian statistics. So, everybody knows the probability of a US Senator being a woman is, I think 21 out of 100 Senators are women. The probability of a woman being a Senator is much less than that. So, the probability of a woman randomly chosen in the US population being a senator is 21 out of 160 million. So those two probabilities are just the reverse conditional of each other, and they have much different interpretation and much much different value, and people are not realizing in the realm of statistical evidence that the probability of getting data that’s surprising if a certain hypothesis is true, which is a p-value, is completely different to the probability that something is true given the data. So those are reverse conditional. So I do a lot of tweets that are in some way one way or another relate to this idea of transposed conditionals and how your choice of what you’re calculating the probability of versus what you want to take for granted or condition on, is all important.
Bailer: So when I mentioned that the analogy you gave about machine learning to stat models as precision medicine to standard clinical information, those were two of the general themes that you just mentioned as well, so can you talk a little bit about what is precision medicine? And what do you think that the – what is the hype associated with it? And then just a quick summary of machine learning and what is a characteristic of when it’s not done well?
Harrell: Great questions. So, the precision medicine- it has multiple forms of, and some forms are a little bit more biologically well-reasoned than others. So an example of that would be- you might take a biopsy of a tumor and you might genotype the tumor and knowing that it might dictate that there’s certain drug pathways that are more pertinent to destroying that tumor, and so there’s genetics guided and chemotherapy and immunotherapy, so that is a kind of precision medicine where there’s been some success. The kind is less biologically directed where a treatment comparison is done and the study is barely big enough to estimate the average treatment effect, much less differential treatment effect, or what’s called heterogeneity of treatment effect, it’s very common for clinical researcher to say let’s take this treatment comparison and subdivide it by the sex of the patient, the race of the patient, and age, various symptoms at presentation, and trying to see if the treatment effect is different in these subgroups. And these subgroups are usually not very biologically well thought out, and since the clinical trial is barely large enough to even estimate the average treatment effect, it’s not going to be large enough to estimate the subgroup specific treatment effect, or the differential treatment effect as we usually like to assess interactions in the statistical model. And so, a lot of the precision managers that second type has not been even attempted to be reproduced, or when it’s attempted to reproduce it has not reproduced. It’s really a grand fishing expedition, and it really doesn’t align with biology, it doesn’t align with pharmacology and how drugs are metabolized. There’s a lot of problems there, so sometimes I tweet that precision medicine has really turned out to be precision capitalism, because its really taking resources into trying to refine small effects that really don’t matter to public health, and end up costing patients more because they’re paying higher amounts of money for targeted therapies using high technology molecular markers, and it’s just redirecting resources with no demonstration of benefits to the public health.
Campbell: In your webinar you talked about and you mentioned this a little bit, reproducibility crisis in science. Could you talk a little more about that and about allowing others to reproduce your work? That was something that you talked about too that I found very interesting.
Harrell: Right, so there’s a lack of reproducibility for some may reasons, like we’ve touched on machine learning. You could have a machine learning algorithm that’s really over learning and it’s really learning noise which is called overfitting and it doesn’t generalize to other settings, so that is not reproducible. A very common cause of nonreproducible research is just having a poor experimental design: it was sort of hopeless from the get-go, or the sample size just wasn’t big enough, so the results are just unreliable and imprecise. And then we talked about p-hacking, or the garden of forking paths analysis to a foregoing conclusion, that won’t get publication. It will pad somebody’s CV but when you try to reproduce that it very seldom is reproducible. And so good experimental having inadequate information base by having inadequate sample sizes are all very important not overanalyzing or massaging or torturing the data until they confess. And then you touched, Richard, on the idea of having the technical reproducibility by using the right tools, so this has been a revolution in statistical computing the last ten years, where a number of tools have come into common use that make it easier for statisticians to script an entire analysis by embedding the analysis code into the report. So, you ruin the code and the code will reproduce all of the figures and graphics in the report, and also reproduce some of the sentences or some of the numbers inserted into sentences like confidence intervals and p-values, and so on. So, the idea of making a report that is reproducible because it has the full script of all the analysis steps, including you might have a command in the script that says here’s where the data are found, download it from the internet, then feed it to the analysis. By having all of these scripted anyone can re-run that and get the same result that you got, whether it is right or wrong you could have made a mistake, but at least they’ll be able to get the same data you got because they have the code there. So that’s the scripted analysis and reproducible statistical reports, and by reproducing modeling strategies, the entire book is reproducible. So, I could reproduce all the tables and figures with a single command and regenerate all the pdf for the book.
Bailer: Well I’m afraid that’s all the time we have for this episode of Stats and Stories. Frank, thank you so much for being here.
Harrell: John and Richard it was a great pleasure, thanks for having me.
Bailer: Stats and Stories is a partnership between Miami University’s Departments of Statistics and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts or other places you can find podcasts. If you’d like to share your thoughts on our program, send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.