ASK THE EXPERT: AI AND MACHINE LEARNING IN BIOLOGY
October 14, 2020
PercayAI team, Dr. Dan Kuster
Today, we’re doing a question and answer session with Dr. Dan Kuster, the CEO and founder of Cambrio, a digital R&D lab focused on life sciences. Dan holds a PhD in biomedical engineering from Washington University in St. Louis, and is now based in the Boston, MA area.
Q: Dan, could you introduce us to the work you do?
Cambrio helps teams build machine learning features that go into software for automating workflows that expert humans are doing in their tasks. Sometimes it is rapid prototyping, other times the problem justifies a more strategic approach. For scientific workflows, you’ve got some complicated data, maybe multiple sources of complicated data and you’re trying to figure out how to centralize those data and use them to make predictions in a very contextually sensitive, domain aware way, such that the scientist who’s going to use this actually embraces it and it helps them solve their problems. We’re a small team, we do this across a variety of domains, but we have a particular focus in pharma, biotech, life sciences in general.
Q: For those who are unfamiliar, how can AI/machine learning accelerate the process of rapidly understanding all of the human data that we have?
“AI” is artificial intelligence, “ML” is machine learning. Artificial intelligence is a highly loaded word, kind of like magic. Once you start to understand it, it’s no longer magic, it’s just engineering. I actually think this is very true in AI and a lot of the things happening today where the underlying “magic” is computation – using computers to do computation more efficiently, more reproducibly, and at a far bigger scale than a human can do them. That’s actually the core mechanism we exploit.
When people say AI/ML, I’m actually thinking about how we use scientific computing to solve a problem, to solve some bottleneck that somebody is experiencing. Particularly in life sciences, we’re looking at bottlenecks where a domain expert, who really understands some area of biology, chemistry, or disease needs more leverage. Those people are rare, and we need to give them more leverage to make sense of these very contextualized, domain-specific, scientific data so that they can have a bigger impact. Scientific computation is a great way to do that.
Historically, if you want to bring scientific computation or AI or machine learning to bear on a scientific problem and solve some bottleneck, the process follows a pretty predictable pattern:
Find some dataset that you understand, work through it, make notes. Then find another data set that you understand, work through it, etc. Do this half a dozen times, for a variety of cases (easy/hard, fast/slow, big/small, etc)
Go back and look for common patterns. Where did we spend the most time? What steps are slow, frustrating, or too sequential? Where did our attention fan out to search for information? How important is it to centralize information?
Keeping the common patterns in mind, abstract and generalize the observations to frame very specific “automation hypotheses” that we might be able to automate with software.
Test it on the examples we already vetted above. How much of the workflow can be automated? At what points does the expert need to make a decision, and what input do they want to see? We get it working in backtest mode, on examples where we know the answers. Then we expand to include new, novel inputs where the answers are not known beforehand.
This core working loop is pretty consistent, from general software (e.g., regular expressions) to “old school” scientific computing to modern AI and machine learning. What is changing today with AI/ML is the way we “abstract and generalize.” Where a mathematician might sit down with a pencil and piece of paper, or a computer scientist sits in front of a computer and thinks really hard...machine learning gives us extra leverage to use computational power and data to generate and test reasonable hypotheses. That turns out to be one of the common slow steps. So instead of sitting down and thinking really hard, we can engineer that process as well, using models that read directly from the data, find patterns, compose patterns of patterns directly from the data, and compare to what we observe.
Doing it at scale on large datasets, manually/interactively ranges from tedious to impossible in a human lifetime manually. But using computation, we can automate learning loops like this to bootstrap our own (human) scientific thinking process. For example, an observation might be statistically obvious but easily overlooked by someone who is expert in a particular therapeutic area. Great, encode that pattern, test it, and ask the expert if it makes sense in this context. We can also find things that are perhaps negative examples or counterfactuals, then see how an observation doesn’t fit the pattern. These kinds of hypothesis-test loops can be especially important where testing is risky or expensive, like we often find in biology and medicine.
Q: You talked a bit about domain experts having more tools to help them achieve a larger impact. How key is domain knowledge when it comes to incorporating ML and AI in a biological context?
It’s exactly as important as your physician going to medical school versus not!
Why do you trust your physician when she recommends you to have a blood test? Or when contextualizing your blood test results, if she says: “Nothing to be worried about here, this is typical for you,” what makes that judgment trustworthy? Or when you get a scary diagnosis, where do you go looking for information and how do you assimilate what you find? You can keep pulling this thread all the way back to the beginning and it comes down to a scientist observing something in a dataset and thinking,”Huh, that’s unexpected, I need to understand how this observation happened in this context,” and that leads to more experiments to gather more data about the context around that experiment and the interpretation around it. So, having an understanding of “what is known” and being able to contextualize, interpret, and explain observations is domain expertise.
Experts are also able to project their observations into the space of what is known and make assertions to guide uncertain investigations; they can assimilate noisy information, and generalize their contextual observations out of that knowledge and make predictions about novel behaviors or observations in that domain: “In the context of X, if Y is true, then we should expect to observe...” When you know what is known and you see something else, you should be able to say, “Oh that fits the pattern of what is known, this makes sense to me; therefore, I’m going to predict the known outcome,” or, “Hey, I have this observation, that’s weird, that doesn’t exist in what is known, I need to understand the context better and test my assumptions.”
Maybe some parts are known, so you think through the analysis in parts, “I’m less confident about this, but I have a reasonable hypothesis about how this part happens in a different context, so let’s start with that observation and project it into this new context, and check our understanding with the following test.”Or it might be something that’s completely unknown. When that happens, a domain expert needs to have the confidence and scientific awareness to be able to say, “I know what is known and this isn’t in there.” One can make assertions about what is unknown when they understand the known space. But of course real-world problems are not so clean-cut. So many of these complex data analysis problems start with developing a properly calibrated estimate for what is known, and then we hone in on the answer by methodically testing away the degrees of uncertainty.
Domain experts can have a big impact on life sciences today because the data and contexts are so complex. So there is a big gap – domain experts can contextualize, create, and interpret data that a non-expert cannot. In general, creating and testing against high-quality datasets is a really important part of executing an automation hypothesis. And specifically in the context of life sciences where there is a flood of complex, high-dimensional data from multi-omics, sensors, electronic records, and more, you still have to contextualize and interpret these results and build understanding of what is known versus what is unknown. So in life sciences there is really no question that domain expertise is required!
The more timely question to be asking is: “How do we amplify the power of a domain expert?” And our answer is to help them focus on the observations they can interpret but a non-expert cannot.
That’s all for the first part of our ask the ML expert series. A big thank you to Dan for joining us to share some of his expertise on the obstacles and opportunities AI and machine learning present in the biology domain.
Next, we’ll have Dan back to talk through the challenges and opportunities of using machine learning on different types of data, and what makes applying AI to drug discovery quite so different than other domains.
Have any thoughts? Want to learn more about our tools or partner with us? Want to schedule a demo? Send us a message below, we’d love to hear from you.