Katie Harron is an Associate Professor in quantitative methods at the UCL Great Ormond Street Institute of Child Health as well as the 2021 Wood Medal recipient for, “her outstanding methodological work on record linkage.” Her research focuses on the development of statistical methods and synthetic data for data linkage, and particularly for evaluating the quality of linkage. She aims to develop methods to exploit the rich data that are collected about populations as we interact with services throughout our lives. Her work facilitates the wider use of these population-based administrative and electronic data sources for epidemiological research, to support clinical trials, and to inform policy. Harron’s applied research focuses on maximizing the use of existing data sources to improve services for vulnerable mothers and families. Her current research links data from health, education, and social care at a national level, in order to improve our understanding of the health of individuals from birth to young adulthood.
Episode Description
Our lives are framed by numbers tracking our performance in school, our financial health, and our physical and emotional wellbeing. While this information can help us figure out what we might do to improve a situation, it’s only part of the statistical story. There’s other information, other data, that might be useful as well. The importance of linking data is the focus of this episode of Stats and Stories, where we explore the statistics behind the stories and the stories behind the statistics with guest Katie Harron.
+Full Transcript
Rosemary Pennington
Our lives are framed by numbers tracking our performance in school or financial health, and our physical and emotional well being. Well, discrete data can help us figure out what we might do to improve a situation. It's only part of the statistical story. There's other information and other data that might be useful as well. The importance of linking data is the focus of this episode of stats and stories, where we explore the statistics behind the stories and the stories behind the statistics. I'm Rosemary Pennington. stats and stories is a production of Miami University's Department of Statistics and media journalism and film, as well as the American Statistical Association. Joining me are our regular panelists, John Bailer, chair of Miami Statistics department, and Richard Campbell Professor Emeritus in Media Journalism and Film. Our guest today is Katie Harron. Harron is an associate professor in quantitative methods at the UCL Great Ormond Street Institute of Child Health. Her methodological research focuses on the development of statistical methods, and synthetic data for data linkage, and particularly for evaluating the quality of linkage, she aims to develop methods to explore the rich data that are collected about populations as we interact with services throughout our lives. Aaron's current research links data from health, education and social care at a national level in order to improve our understanding of the health of individuals from birth to young adulthood. And Harrons work in this area was recognized by the Royal Statistical Society, which awarded her the 2021 Wood Medal. Katie, thank you so much for being here today. And congratulations on this honor. Thank you. Thanks for inviting me. Could you explain to us what data linkage is,
Katie Harron
A data linkage is about bringing together different pieces of information about the same individual that may be captured in different data sources. So when we bring together data from different data sources, we start to build a picture of someone's life. It's like bringing together different pieces of a jigsaw to try and create a bigger picture and every additional piece of information that we can gather from different places. And every additional piece of information helps so linking data from different services. And what we often do is link data from government departments, for example, the National Health Service in the UK or the Department of Education, to understand how different parts of people's lives fit together, and hopefully ultimately, in the end that enables people in the government to make decisions with a greater agenda.
John Bailer
So could you flesh this out a little bit, I like the jigsaw, puzzle image. And all of a sudden, I found myself thinking about the number of times I've had missing pieces. So certainly some of the bias comments that you make in your work, you probably address that issue or maybe some of those pieces that might be missing. But I'm just asking for just a specific particular example that you think would really help kind of flesh out the use of data linkage.
Katie Harron
Yeah, okay. So when we interact with sensors, so you go to the doctor, or you go to school or start a new job, a record is created that helps those organizations fulfill their roles, but it turns out that those data sources are incredibly valuable for research because they capture so much of the population. So, for example, linking data from hospitals for babies who were born too early, or preterm with education data that's captured later on in childhood cannot help us to understand the needs of preterm babies, and how we might be able to best support them during the early years in order to help prepare them for school. This example is quite close to my heart. My daughter was born at 28 weeks. So almost three months before my due date. And obviously at the start, I was terrified about the immediate future. But as the sort of immediate danger passed, I was more uncertain about the long term and my brain so linking together data across the life course we talk about from cradle to grave is incredibly informative and valuable. And I yeah, I think bringing together information from lots of different places, helps us to build that picture of a jigsaw bringing together different needs.
Richard Campbell
Support people, you talk in your research about missing links, give us an example of a missed link.
Katie Harron
Yeah, so Okay, if we do We would have a unique identifier that would capture the same individual in different datasets. That's very rarely available in some countries in the Nordic countries, for example, they've got a long history of population registries and personal identification numbers. So it's a little bit more straightforward, but definitely not in the UK. So we don't have the National Health Service number in our education data on bias and none of that's on employment records or tax records. So we use a set of partial identifiers, like name database address, to try and find records that belong to the same people. But because those identifiers might not be completely unique for a person or they might contain messy, missing or incorrect information, it means that we don't always get the linkage. Exactly right. So that's particularly the case for the types of data that I'm talking about routinely collected data that's collected not for research purposes. So it's not necessarily completely accurate. And you know, when you're linking data from different time periods as well, things change, women change their surnames, people change out addresses, they move around, month to day, affairs might get transposed to those kinds of things. So there are there are different approaches for finding the best links straightforward method is deterministic linkage or rule based linkage, where you come up with a set of rules to decide are these records are linked, do they belong to the same person or not over a probabilistic approach, which incorporates probabilities and takes into account the fact that not all identifiers in the record might agree, but they give you, they reflect the likelihood that records really do belong to the same people. So when we have a missed link, that means that we're unable to find the correct record for somebody. And when we have a false link, that means that we are variously linked together, records that one, two different people. And those different types of areas have their own implications and can cause different challenges for research. So sometimes you might want to prioritize minimizing missed links. And sometimes you might want to prioritize minimizing false links, and the two are usually traded off against each other. So for example, a very simple example, a study that I worked on was looking at infection rates for children in pediatric intensive care. So we have a very good registry data set that captures information on children admitted to pediatric intensive care, but it doesn't accurately collect infection status that's recorded separately in the laboratory dataset. So we try to link these two datasets together. If we miss links, we fail to identify that somebody in pediatric intensive care had an infection, then we would underestimate the infection rate. And if we have false links, we link records, infection records, where a child that really didn't have the infection, then that can cause us to overestimate infection rates. And if you take that a step further, if we're thinking about trends over time, you might want to prioritize areas that cancel each other out. So overall, you might have the same amount of mismatches, the same amount of false matches. So overall, you get the correct infection rate. But it's a little bit more nuanced than that. Because if you have different data quality over time, for example, identify as improving in more recent data collection, then you could have more accurate linkage at the end of the study period, and most accurately get to the status to be period, which might seem like infection rates are increasing, when really, it's just an artifact of the data quality.
John Bailer
So when you were introducing that topic that you were saying that sometimes you may want to sort of be willing to make one error more than another? Yeah, can you give an example of when you might prefer to minimize missed links versus another scenario when you might want to minimize false links.
Katie Harron
So an example would be if and if you had, if you were trying to do some sort of fraud detection, you would want to capture as many possible candidate links as possible, you could then follow them up at a later date to try and discard the ones that you're not interested in. But you want to start out by casting the net really wide and get as many candidate records as possible. So that would be a case of maximizing sensitivity of the linkage. If you want to think of an example where specificity is more important or minimizing the false matches, you can think about something that's used for operational purposes. So if we are administering drugs to somebody, we want to be absolutely certain that we've got the right records for that person in case of any contradictions.
John Bailer
So what, how hard is it to do this? You know, just you know, so this, I mean, I, as you were talking about this the rule based approach As the probabilistic approaches, I found myself wondering, you know, the find that you're delighted if you get half of them on the first pass? And then you have to struggle with half or, I mean, what's a good target for kind of, as you think about, it seems like you'd have to process this in waves?
Katie Harron
Yeah, it really depends. It really depends on the question that you're asking from the data, it really depends on the research question. And the sorts of data that you're using and how important it is to capture the whole population or not. So for example, some colleagues of mine at UCL were linking data for cohorts to the homeless people. And because the data was so difficult to obtain, and it was so messy, and, you know, people move around so much. addresses, of course, are really difficult for that population. I think they got around 60% linkage, or they match 60% of their records to hospital records. And that was considered to be, you know, quite good really for that population. That's great. But I think in other data sets, you would expect much higher linkage rates in a study that I'm involved in, that's linking hospital data with education data for all children. In England, it's called child study. And we've linked 99% of records for schoolchildren to their hospital records. So that's, you know, much better. But the important thing is who you missed in that 1%, or that 40%. So if you know those areas, if those missing links are random, then that's, you know, a missing data problem. You obviously have a smaller sample size, which has implications for power, which is sometimes less of a problem in the sorts of administrative data that we're talking about. Because the sample sizes are so large, they cover so much of the population. But if the errors are non random, which is actually what we often find, it's usually the more vulnerable groups are the ones that were particularly interested in who were more highly mobile and change addresses more frequently. And if the areas are more likely to happen for those groups, then you have the potential to introduce selection bias in your analysis, which is, you know, they're really problematic. And we do see that it's all about the underlying data quality. So whether it's healthy men who don't go to see the doctor very frequently, there might be differences in their data captured compared with women, or, you know, women are more likely to change their names, so they might be less likely to link. But we also see huge differences according to ethnicity and ethnic groups, more complicated names, structures, or less familiar sorts of names that might be more prone to having typographical errors if someone's recording those names. And we often see differential linkage rates according to see and also to social status, health, health status, and most of the things.
Rosemary Pennington
You're listening to stats and stories. And today, we're talking with Katie Harron, the RSS is 2021 wood metal awardee about data linkage. Katie, as you were talking about this, you mentioned the issue of bias and selection bias. I wonder, this sounds super complicated, but also makes so much sense to try to sort of get this very holistic, broad view of how someone is doing since we're talking about we've been talking about health a lot. But I wonder, what are some things that researchers who want to do this kind of linking, what should they keep in mind to try to mitigate bias that might creep into their work in certain ways, whether it's selection bias or other kinds of biases?
Katie Harron
Yeah, that's a really good question. I think there are two elements to this. The first is optimizing the linkage strategy, and getting the linkages accurate as possible, which is, to some extent, constrained by the underlying data quality, and also time and resource you can spend forever trying to tweak the algorithm to make sure that you're capturing as many people as possible. So there's a lot of optimization and design that's important for linkage algorithms, and especially considering what you know about the data. And if you know that certain groups are less likely to make, for example, once you've accepted that, is unlikely to ever get perfect linkage, especially with messy administrative data that people, you know, there's always some human error involved, then it's about understanding where the errors are trying to describe them, try to account for them in analysis, and to understand what the implications might be in terms of bias. And there's lots of things we can do. So very simply, we can compare the characteristics of the records that we have linked with the records that we haven't linked. So that again, is analogous to the missing data problem. You want to know whether you've got missing data, or the people who dropped out of your study lights, the ones who stayed at the Institute, so it's very similar for linkage problems. You can try and estimate the linkage error rates of matches and mismatch. By using different approaches, so you might have a subset of the data where you really are sure that you have accurate linkage, it might be a certain group of records, you do have a unique, unique identifier completed. And so based on that, you can try and estimate the linkage quality that you then apply to the rest of the dataset. So using a gold standard, or a reference data set, basically, to estimate error rates, to try and work out how the errors are distributed amongst the groups or subgroups that you're interested in. You can also compare with external data sources. So if we're thinking about linking to mortality records, do the mortality rates that we ended up with in the link data make sense compared to what we expected the population to expect? And we can think about positive and negative control. So if there is a group of records that you think definitely should have a match? How many of those do you have? How many of those do you think, with a group of records that you definitely think, shouldn't match. So for example, linking male health records, to hospital records to a birth, something that definitely shouldn't happen in the data? How many, how many of those do you come up with that could be indicative of a false match. So there's lots of things that you can do to try and describe and understand the error rates. And then in the analysis, it's about reflecting the uncertainty that you have in the data. So not being too certain about not being overly certain about the inferences that you make or properly accounting for the uncertainty within your analysis. So you can think about doing multiple imputation approaches like you would with missing data problems, again, or quantitative bias analysis.
Richard Campbell
So I'm gonna put myself in the position of being a journalist who's just read your paper, your study on the associations between pre pregnancies, psych, psycho social risk factors and infant outcomes. So I'm reading this, and I'm trying to make sense of it. And of course, I get to the numbers, and I don't understand them, because I'm not a statistician. So what I want, what I have to do as a journalist is translate this for the general public. So what I want to know from you is, what do I need to tell them about this? I mean, I understand what groups are most at risk, which is really interesting. This is a really important study, because I want to get it right. So how do I, how do I talk about women who are more at risk than for the reasons that you list mental health issues? Being in an environment, there's probably a lot of abuse going on in high deprivation areas? How do I report how much more at risk these women are? How do I say that because I don't understand the numbers that you've used. Now, if I'm really good, I'm gonna come and talk to you about this. Right. And because I want to get this right,
Katie Harron
Yeah, I think the study that I did looked at outcomes to different groups of methods of mothers. We looked at teenage mothers, mothers living in more deprived areas, and know their history of mental health conditions over admissions for self harm, violence or substance abuse. And as we expected, we saw that teenage mothers had some of the worst outcomes. And the outcomes that we looked at were various outcomes, preterm birth, birth weight, and infant mortality. But we also found that Korea teenage mothers, so irrespective of current page, if a mother had given birth, the first time as a teenager, they also had children with worse outcomes, and those with mental health conditions or history, a history of adverse adversity related admissions, we, we see that these groups also stand to benefit from additional support, irrespective of their age. So lots of the interventions that we have developed, or let me put this another way, the primary focus of interventions for additional support during and post pregnancy in the UK, focus on teenage mothers. But what we see in the data is by looking at their maternal history, it's not, it's not as simple as just age, we can pick other groups as well. Or we can observe other groups who, who are equally as likely as teenage mothers to have low birth weight babies, for example. And actually, what we found is that the biggest group, the biggest group, in terms of numbers, although their risk was not necessarily as happy as some of the other mothers is, mothers living in the most deprived areas. So in terms of population attributable risk, if we were going to intervene then tackling the Chain problems of poverty could improve outcomes for the largest numbers of families.
John Bailer
I enjoyed this work, I thought it was really interesting. And I, I saw that, you know, it had that nice connection to where we started the program, which is the linkage. So you had this component of, of linkage of deliveries and live births within this within the study. You know, but then I thought of the recommendations, I thought it was really interesting, the idea of proportionate universalism. Yeah. And that's something I had never, never seen and heard about. Could you talk just a little bit about what that meant? And why, you know, that's sort of tied to what you were just mentioning was sort of how do you intervene, intervene upstream in terms of trying to improve outcome?
Katie Harron
Yeah, so the question is, proportionate universalism is something that underpins what's at the home visiting or health visiting, that happens in the UK, and in many other countries, preventative programs. So the idea is that you have an universal approach where everybody has the opportunity to be contacted and to be reached and supported, but then that the intensity or the level of that contact varies, according to me. So the most vulnerable groups would have more contacts, more visits from a health minister after birth. And the idea is that the universal approach means that everybody has an opportunity to be captured to be recognized as being in need of some support, but that you focus your attention on the groups that stand to benefit the most, according to the expected outcomes, that that you're likely to have, depending on which group you fall into.
John Bailer
So it seems like there was this really incredible interesting follow up that that I'm sort of picturing, I'm picturing talking to you in 2023, or 2025, you know, that that you start to implement some of the suggestions from this, if you were going to think about a future study, to set, okay, based on what we've done here, here are things that might be done. And so if someone were to design an intervention, how might you use it? How would you think about following up to assess whether or not your insights from here would bear fruit in that future?
Katie Harron
Yeah, that's a really good question. I think the intervention development part is, is that the thing that's really, really needed, and that needs to be focused on the types of mothers who need the most support, and families, so for interventions to work, they need to work for the groups of people that we're talking about. And they're likely to look very different according to whether we're looking at teenage mothers or older mothers, or the history of drug drug misuse, for example. I think the really exciting thing about data linkage is that we've got this opportunity that once we have implemented these interventions, we've got a way of evaluating them, the whole population who, who gets them and who doesn't. So we know that the gold standard for evidence is randomized control trials, but they are incredibly difficult to run and expensive to run, especially for something like health visiting that's so well established and is universal already, it's very difficult to have a control group in that setting. But we can use these population based approaches using linkage across health education and social care to really see the benefits or an ops of particular interventions. Not yet so much of the hard work is done at the start and intervention development. And then what the one of the advantages of using population based linkage studies is that we capture such a high proportion of the overall population, including those hard to reach groups, who might fall through the cracks in terms of traditional research
Richard Campbell
studies, you get to play any role in the intervention. I mean, how much influence do you have? I mean, you do the study, you do the work, we know what needs to be done. And what do you feel like is your responsibility for that next step? Or do you know, we ask this of journalists all the time they go out, they report stuff, and then they walk away, and they do another story, and what they've reported is very revealing, and something should be done about it. And you're in that situation as somebody that's doing this kind of work.
Katie Harron
Yeah, I think what's really important for this sort of work is having good engagement with stakeholders. So with the government departments department of health and social care in the UK, to understand their priorities, but also to inform their work and their focus and where the funding is going to be. That's, that's, you know, that's that's always a really important part of our work is trying to translate what we see in the data into something that's going to be meaningful. To help governments make decisions about where we resources should be, we should be focused. So I think that ongoing engagement is incredibly important. And it's not just with not just with organizations and with government departments, but also with the public, because we're using data about the public. And we really need that sort of social contract so that we can continue to use these data for public benefit. So lots of work we do involves lots of patient and public involvement. And we try to get feedback on the work that we're doing, we try to understand priorities of different groups to transform with the studies that we do, and also to help us disseminate in ways that aren't just academic.
Rosemary Pennington
Well, that's all the time we have for this episode of stats and stories. Katie, thank you so much for being here today. Stats and Stories is a partnership between Miami University’s Departments of Statistics, and Media, Journalism and Film, and the American Statistical Association. You can follow us on Twitter, Apple podcasts, or other places you can find podcasts. If you’d like to share your thoughts on the program send your email to statsandstories@miamioh.edu or check us out at statsandstories.net, and be sure to listen for future editions of Stats and Stories, where we discuss the statistics behind the stories and the stories behind the statistics.