David Banks is a statistician at Duke University. He is a fellow of the American Statistical Association and the Institute of Mathematical Statistics, and a former editor of the Journal of the American Statistical Association and Statistics and Public Policy . His major areas of research include risk analysis, syndromic surveillance, agent-based models, dynamic text networks, and computational advertising.
+ Full Transcript
Rosemary Pennington : There is a universe of data available for researchers and journalists to access in order to help people better understand their world. And sometimes that data can lead to big stories with major policy implications. That possibility is the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. Stats and Stories is a production of Miami University's Departments of Statistics and Media, Journalism and Film and the American Statistical Association. Joining me in the studio are regular panelists John Bailer, Chair of Miami Statistics Department and Richard Campbell, Chair of Media, Journalism and Film. Today's guest is David Banks, Professor of the Practice of Statistics at Duke University. Banks is also a Fellow of the Institute of Mathematical Statistics and the Royal Statistical Society. He's a former Editor of the Journal of the American Statistical Association and founding Editor of Statistics and Policy. David, thank you so much for being here today.
David Banks : Thank you.
Pennington : David your career has spanned decades with you moving between academia and government agencies. How did statistics become a calling for you?
Banks : Oh gosh, well, the glib answer is that I've always wanted to prove other people wrong at the .001 level, but the larger story is that on the day I graduated from college I was queued up to receive my diploma, standing behind a woman I knew who was a math major. I was an anthropology major, and I was asking her what she intended to do next, and she told me that she was going to go to North Carolina State University to get a degree in statistics. And I was astonished. I didn't realize that was a thing you could do. I didn't know that that was the career. But I'd always enjoyed statistical arguments in the courses I had taken and so in the fullness of time I decided that I would try and get a degree in statistics and it just kept going.
John Bailer : Hey David you formed a Statistics and Public Policy Journal. Why was such a journal needed? What was the gap out there this addressed?
Banks : Oh it's a huge gap. If you believe evidence based Public Policy and all good-hearted right-thinking people have to think that, then yes we need to have a way to assess the quality of the evidence in support of or against some Public Policy Initiative. And up until recently The American Statistical Association had no such organ that would officially do that. Presumably statisticians working for Public Policy groups would occasionally weigh in and do some sort of analysis that would be pertinent. But there wasn't an organized formal mechanism to do this and put it in a peer review situation. The new Journal, I hope, goes a long way to address that deficiency.
Bailer : In looking at the Journal you had mentioned this article, this series that the Journal had done looking at possible cancer clusters that existed in Florida. Can you give us a little bit of background behind that story?
Banks : In late 2013 I was giving a talk at the University of West Florida. I'd been invited to do that by a friend of mine with whom I had gone to graduate school together back at Virginia Tech, Raid Amin. And during the course of the day he mentioned that he had done an analysis of Florida Pediatric Cancer data and had found what he thought were hot-spots. Places where there were too many cancer cases to be expected statistically, and he had contacted the Florida State Department of Health about this and he had been told that this is a very interesting study but since it had not been replicated, there would be no further investigation on the part of the Florida State Department of Health. And he was wondering how one might proceed. And I suggested that we replicate the analysis, and we take his data, update it with some of the more recent data that had come in over the intervening three years and give that data set to five other groups of statisticians who all work in epidemiology, syndromic surveillance, related areas. And we would have each of them analyze the data separately using their own models and their own methodologies and see if there was a general consensus on the results.
Bailer : How often has this been done in the past?
Banks : As nearly as I can see, not at all. Lance Waller in his discussion of the 5 analyses that were done, Lance Waller the Chair of the Department of Biostatistics at Emory University said that there had been attempts to compare methodologies but that was a methodological comparison rather than an attempt to see if different methodologies agreed on the same findings, which is a different perspective. Instead of trying to determine which test is the most powerful, which test is the most sensitive, which test makes the most reasonable assumptions, instead the question is "do many different minds, many different analyses converge to a common conclusion?"
Richard Campbell : In reading some of the articles about this Florida Pediatric Cancer studies, as a non-statistician I come across the phrase "cleaning data", which is something that had to be done here. Can you talk for my benefit and for our general population that listens to us, what that means? And what it meant in terms of this study?
Banks : Almost all data sets need to be cleaned and prepped and otherwise adjusted to be able to make sensible analyses. For example, the data that was available from the Florida Association of Pediatric Tumor Programs, which was the source of the counts of cancer data, was tabulated according to zip code tabulation areas. And that's not exactly the same thing as a zip code, it's an artificial construct in which you try and get a batch of census blocks that mostly have the same zip code and you use that as your unit of analysis. A zip code tabulation area is not the same as a zip code area. Zip code areas are sometimes a little weird. For example, if you have a large building that may be the only building in the entire zip code, and it really wouldn't make sense in sort of a huge office building to expect there to be lots of pediatric cancer patients. Similarly, a zip code can be a long rural route that threads through the countryside, corresponding to what the postal services use to deliver mail. And that again is probably not the right thing to use. So a zip code tabulation area is an attempt to find census blocks that sort of meaningfully correspond to clusters of geographically nearby people. Sometimes you wind up finding reports in the Florida Association for Pediatric Tumor Programs for zip codes that don't exist in Florida. That's a strange thing that probably needs to be cleaned out. There are certainly possible ways that could arise. For example you could imagine that a family just moved in to Florida from somewhere else and when they take their child for examination they listed their old zip code and that was what got entered into the system. That type of data needs to be removed because it's certainly misleading. Occasionally you'll find situations in which there may be multiple zip codes for the same or zip code tabulation areas for the same patient. That can arise for example, perhaps, in a situation where a divorced couple that is sharing custody. There are lots of other ways in which data cleaning might be necessary. For example there's a complex numerical classification of cancer types and so if you're talking about Lymphomas there are actually many sub-categories of Lymphoma and they are assigned different classification numbers and occasionally, probably through some minor data-entry error, you wind up with a child who has a diagnosis that isn't reasonable, it may not be a cancer but it winds up in the Florida dataset. So that's some of the data cleaning that goes on. It can get much more complicated than that but, does that give you a sense of things Richard?
Campbell : Yes, that's very good thank you.
Pennington : You're listening to Stats and Stories where we discuss the statistics behind the stories and the stories behind the statistics. Today's topic is Stats and Public Policy. Our special guest is David Banks, Professor of the Practice of Statistics at Duke University. Now, David you were talking about how you had these five studies that were replicating, or attempting to replicate the findings of Amin's work that showed that there were these cancer clusters the Department in Florida said "great, interesting, but we can't do anything because you need to replicate it". You have these five studies go out and see if they can replicate it, so my question for you is, these researchers all using their own statistical models, did they come to a consensus that 'Yes, these cancer clusters existed', and how much overlap did you see between these various articles?
Banks : There was the consensus on all of the hot spots. Each of these studies would identify some that were in common and some that were different. Part of it depends upon the level of resolution of the analysis and part of it depends upon the types of assumptions that were made. Some of the papers controlled for race, some of them did not control for race, and there are good reasons to take account of that in different ways. Some looked at spatio-temporal models so you're looking at space and time together, and finding that there may be certain years when there is a higher rate of cancer in particular areas than you'd expect. Other papers only looked at the spatial distribution of disease, accumulated over time. But all five papers did identify a same region southwest of Jacksonville and in the Miami area that looked like hot spots according to all five different methodologies. And this is not necessarily a complete surprise. It's well known that many types of cancer are more prevalent in urban areas. Somehow urban areas are just more toxic, and so for many different kinds of cancer you have increased rates in metropolitan areas. Nonetheless these rates were all about 10% greater than the baseline rate for pediatric cancers in those areas, which is medically significant and all five methods agreed. So, that seemed compelling.
Bailer : What were the baseline rates for these types of cancers?
Banks : I don't know that. It sometimes is broken out… Some of these cancers differ in their rate by race and by gender. And so there's a more complicated answer than just giving you one number. Additionally, there are some people who looked at all of the cancers pooled together, whereas others looked at lymphomas, blastomas, other types of cancers separately, and so that also makes it hard to give one number answer.
Pennington : I was going to say, because as I was reading Amin's sort of explanations for why this reanalysis was happening, you know he mentioned in this sort of story set up the health officials in Florida made no policy decisions based upon his original study. That included that data from 2000 to 2007, showing these clusters. Did anything substantive develop after the publication of these five articles that did suggest that there were these clusters in these places? Or at least some of these clusters?
Banks : No. Nothing happened. And that's not necessarily the wrong thing, although I do think that a more constructive conversation could be had. The Florida State Department of Health released a press release saying that their scientists had studied the matter raised in these five papers and concluded that there was no public health problem. I did invite their scientists to write a paper and publish it with Statistics and Public Policy explaining their reasoning, but I have not heard back from them.
Bailer : It's in the mail. I find this to be such a remarkable idea. It's such a cool study to think about doing this type of replication. I mean, and with your colleagues' work being criticized for lack of replication, this does certainly feel like a slam dunk in terms of having done this. And, do you think part of the problem is the… that the challenge to identify the exposure?
Banks : That is certainly true, John, but let me also emphasize that I don't think that the Florida State Department of Health's decision was necessarily wrong. I think their rationalization was maybe a little bogus, but you want to proceed slowly and carefully in this type of matter and it's often very difficult to draw a straight line causal inference between this exposure and this disease. If you think back on the cases in the past where that actually has been done, it took a pretty rare set of circumstances. For example, the link between asbestos workers and mesothelioma is something in which you had a group of people working in a very rare specialized occupation and they are coming down with these rare, very specific diseases, and there you can make a pretty clean conclusion. Similarly the people who painted the radium dials on watches, they were being exposed to radiation and developing strange cancers. That again was a fairly small distinct group having a special outcome. Homosexual men and AIDS is another case where it was very clear. A rare Karposi's sarcoma associated with a very specific lifestyle. In general we're not in a situation like John Snow's Broad Street pump, where you make a little map and it's totally obvious where the source of the problem is. To say that all of Miami has an elevated rate of Pediatric cancers is probably not specific enough to be actionable. I do think that the Florida State Department of Health probably had some obligation to look more deeply. And they have better data than we do on the people who actually contracted these diseases. So, in principle, they have a better microscope and can dig deeper and find out about the types of water, the exposure they might have had to atmospheric carcinogens from potential incinerators in the area. They could ask better questions than we can, but that's something I think they ought to do and unfortunately they chose not to.
Campbell : How much of a problem do you run into… You mention this when you're writing about this being a statistician and not a doctor, and the kind of clash between the statistics profession and what the data and evidence that you provide versus the medical profession. You talk about the difficulty sometimes in getting access to confidential medical records, and that not being a doctor may have been a problem there.
Banks : Well, even if I were a doctor I would have to sort of pass under the HIPAA laws, and so a general physician just can't walk in and get access to this type of data under any circumstances. Nonetheless the medical community has a way of looking at the world and it's different from the way statisticians look at the world. Part of that may be a result of medical school training, in which a statistician is not generally a research leader but simply somebody who runs code that the chief investigator who's a physician thinks ought to be run. That's probably a little unfair, there are lots of good partnerships between statisticians and physicians. But there are also many stories about people who just weren't able to make it work. So there is a handicap there. The Florida State Department of Health certainly rejected Dr. Raid's findings in the first round on the grounds that he wasn't an epidemiologist. Which could be a statistician although I imagine that they were interpreting it as being sort of an M.D./Ph.D. type situation. Nonetheless the onus is probably of the statistician to go out and make friends and be persuasive and have such clear data that it's hard to say no.
Bailer : You know, one thing that seemed like a natural follow-up and I think that your points about identifying the source and source exposure for some of these outcomes is really tricky, but certainly the result of these clusters or of these hot spots that were identified would suggest a follow up study. I mean a case control study or just as a way to try to really dive in and see if this could be identified would really have been a natural next step.
Banks : I completely agree, and it's even more attractive now because late in November of last year the E.P.A. released a whole batch of data on atmospheric and waterborne pollutants. They're called Fate and Transport models and they say if an incinerator at this location releases so many tons of some pollutant per year, and given the prevailing wind patterns and the weight of the pollutant and its chemical characteristics, where will that pollutant wind up? What's the distribution of the amount of pollution from that point source across the neighboring counties? And with the information for both water and air pollution, one could really build a very detailed pollution map, say for Florida or the entire United States and then look to see whether or not one can tease out a causal relation between the cancer rates in those areas and the amount and kind of pollution in those areas.
Pennington : You're listening to Stats and Stories and our discussion today focuses on stats and Public Policy. Our guest is Duke University's David Banks. David, going back to the five articles you published what was the media coverage, the news coverage of the publication of these articles like?
Banks : There were a number of people in Florida who picked up on this. A number of television stations and print media that called me and other people who authored these papers, sort of get our sense of things. I thought most of the media coverage tended to be a little hyperbolic and to exaggerate the actual concern. I would like to go cautiously as John was saying, this is a strong argument for reexamining the problem and looking at better data. I don't think it's a basis for public warnings. I don't think it's much of a basis for criticizing the Florida State Department of Health. I think they're acting fairly responsibly aside from claiming that their scientists have reviewed it secretly. So, I don't object to the news coverage. I recognize that this is a process and one of the ways that statistics makes the world better is by working with media, and media will have its own obligations its own code of ethics and its own interests in trying to pump a story out. If you write a story that there's a couple of statisticians that think that there is some evidence that there is a mild increase in the carcinogenic- carcinogenicity in certain areas of the state, and these are often pretty big areas, that's not going to get a lot of eyeballs, and it's not going to serve the public good by focusing attention on a problem. So, perhaps they have to exaggerate a little bit.
Campbell : I was going to say you mentioned hyperbole. Is this a problem that you see in terms of…. I mean I think you're identifying that the journalist has to tell a story, right? That's the- and the story has to have some drama, and it has to have some conflict. And I think you're sort of getting at a kind of central tension here, how do you do that and still sort of honor what the evidence and the data actually show?
Banks : Yes. And if I might discuss a moral parable involving Gregor Mendel I think that might frame the issue for us. Gregor Mendel was the Augustinian monk who discovered the laws of genetics whereby inheritance worked. He did famous experiments in pea plants, and he looked at the color of pea plants, he looked at wrinkled pods, he looked at a number of traits that were conveyed by dominant and recessive genes. And in his scientific paper that he published on this, he reported the results of many, many cross-breeding experiments, and in every single one his results were too good to be true. Under his experiment you would expect to have 25% of the plants have green peas and 75% of the plants have yellow peas. And so he'd cross 100 pairs, look at the offspring and report that he had 76 yellow plants and 24 green plants. And statisticians realize that that's too good to be true. You would expect to get 80-20 some of the time. You'd never get as close to the called shot as Gregor claimed. So Gregor Mendel was in a bind. At that time there was no such thing as a goodness of fit test, so basically he had to fudge the numbers in order to come up with outcomes that were so compelling that the scientific community would listen to him and respect his ideas. So he had to lie in order to tell the truth. But he was a man of God, and God punished him by ensuring that his paper languished unread for decades after his death until the particulate theory of inheritance was independently discovered by three other people.
Campbell : That's a great story.
Banks : But the implication of course is that news media may have to exaggerate things in order to shine the spotlight of attention on the issue that is needed.
Bailer : So David, how are we going to prepare statisticians to work with journalism given this moral imperative that you've given us?
Banks : I'm not sure. I think it has to be a partnership, and I think we both have to be forgiving and understanding of the different roles in which we work. A journal article is not a peer reviewed publication. It serves other purposes.
Bailer : Great point. You mentioned the idea of partnership. We certainly have an example of that with three of us sitting around the table here in Oxford. But the idea of doing this
for the next generation of students to work both as journalists entering the field and statisticians entering the field. Do you have any thoughts about trying to nurture that collaboration?
Banks : Well, I hope that the American Statistical Association might be able to broker some of those partnerships and create a platform under which statisticians and journalists can meet and cooperate. For many years, the American Statistical Association was reluctant to weigh in on matters of public policy, for fear that they would be perceived as being partisan. Because almost any sort of public policy debate has multiple sides and you know, statisticians weighed in and said that "Yes, there is evidence that smoking causes lung cancer" then that might be perceived of being in some sense biased and for decades we didn't want to do that. Instead we left that type of evidence-based policy to statisticians within the federal statistical agencies. So you had some amazing people in those federal statistical agencies. Janet Norwood, you know would speak truth to power no matter what. Monroe Sirken was a rock. Tom Jabine was amazing. And each of these statisticians led important official statistics agencies and they were uncompromising with their data. That causes problems. Uncompromising statisticians are not appreciated in the corridors of power. And so over time things began to evolve. I would say that between the detonation of the atomic bomb at the end of World War II and landing a man on the Moon, scientists in general had a lot of cache in Washington and they had seats at the tables of power. John von Neumann was a member of the atomic energy commission. John Tukey advised six American Presidents. But over time when a guy in a lab coat shows up at a Senator's office and says "Sir, you must reverse your position on this, the numbers say so." That is not a popular position. And so it's a parable. Politicians learn various ways to avoid that type of statistical mandate. For example if a federal statistician produces a study that shows that leaded gasoline is problematic, it takes a lot of political courage, as Jimmy Carter showed, to issue warnings that would remove lead from gasoline. And to do so at a time that was an oil shortage crisis, because it raised the cost of gasoline over and above the already high level. Following on from that, other politicians weren't so brave. So if a reporter arrived, they would raise ten questions about the validity of that report. Each of those ten questions might be trivial, irrelevant or easily answered, but the fact that there are ten of them means that the politician can call for a new report, and that kicks the issue down the road for a couple of years. Similarly, the- many of the heads of federal statistical agencies used to be career bureaucrats that rose to the position through many decades of service. Now, most of the heads of statistical agencies are Presidential appointees. And in fact, it is usually the custom now that a statistician in a federal agency is not supposed to interpret their data. They are supposed to analyze the data, present their findings to the political appointee in charge of the agency, who then interprets the findings for Congress and the Executive Branch. And that puts a lot of barriers in the way to having meaningful statistical impact on public policy. And for that reason, I think that the American Statistical Association has become braver and bolder in terms of being a voice for statisticians doing analyses that are policy relevant. And the establishment of this youngest ASA journal, Statistics and Public Policy is, I hope one vehicle for that sort of empowerment.
Bailer : David that's a great point, and I think that it's exciting work and thanks for your leadership in that and for spending the time with us today.
Pennington : That's all the time we have for this episode of Stats and Stories. Stats and Stories is a partnership between Miami University's Departments of Statistics and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter or iTunes, if you'd like to share your thoughts on the programs email statsandstories@miamioh.edu and be sure to listen for future editions of Stats and Stories where we discuss the statistics behind the stories and the stories behind the statistics.