[Originally published on PL Perspectives, 11/4/2019]

Most PL research is concerned with the engineering of software. Programs, in the narrative driving this work, are human-built artifacts. The PL research challenge is to make sure that they can be engineered, with as little effort as possible, to be reliable and have high performance.

But programming languages can also be used for modeling natural phenomena. This is the central goal in the discipline of computational science, which develops tools for programmatically modeling real-world processes ranging from nuclear reactions to climate change to protein synthesis.

PL research has always played a critical role in computational science. It is impossible to simulate complex natural processes without high-performance computing (HPC), and advances in HPC are often, really, advances in compilers and runtime systems.

However, a new, intriguing path for PL ideas to shape our understanding of the natural world has opened up in the recent past. This path concerns not just the representation of the natural world as programs, but also the automatic, data-driven discovery of these programs through program synthesis.

Background: AI for Science

Scientific discovery is inherently data-driven: scientists pore over experimental data, come up with hypotheses, and then validate, reject or refine these hypotheses using new experiments. Recent progress in machine learning has opened up the possibility of automating this process through data-driven algorithms that can come up with new scientific hypotheses and guide the design of experiments.

Machine learning for science has already produced some impressive results. The Large Hadron Collider (LHC) uses neural networks to identify interesting events in large amounts of data produced by the accelerator. Large-scale climate simulations are now beginning to use neural networks to model processes that happen below the resolution of the simulation grid. Deep learning algorithms have been recently used to synthesize organic molecules such as drug compounds. The US National Science Foundation’s call for National AI Research Institutes identifies “AI for discovery in physics” and “AI for accelerating molecular synthesis” as core themes.

At the same time, deep networks (and more generally, blackbox machine learning methods) have not yet been successful at explaining natural phenomena and designing scientific experiments. One reason for this is that science is about systematic comprehension of the world, something that an opaque deep network cannot directly provide. Also, science requires reasoning about chains of cause and effect, and expects learned models to be consistent with existing scientific knowledge. Deep networks tend to not do well on either count. Finally, the process of deep learning is often unreliable, meaning that it either takes unreasonable amounts of data or exhibits high variance across different trials, and this behavior runs counter to the objective of a robust process of scientific discovery.

Program Synthesis for Science

A PL-based alternative to data-driven, blackbox learning is to design a high-level programming language that can describe models of scientific processes (for example, a series of experiments that must be performed to achieve a certain goal) and the world. The task of data-driven discovery can then be formulated as program synthesis: the generation of a program that optimally fits experimental data. One advantage of such an approach is that unlike a traditional statistical model, a symbolic language can express constraints that capture known facts about the world, and one can assert such constraints during synthesis. Also, a high-level programming language can use abstract, compositional primitives to describe complex time-dependent relationships and natural hierarchical structures in the world.

Program synthesis under such a formulation can leverage the substantial recent progress on the problem in the PL/formal methods community. For example, a search for programs can exploit the compositional structure of programmatic models and use SMT solvers to search for program parameters.

Program Synthesis in Executable Biology

The most prominent existing application of program synthesis in the natural sciences is in executable biology. Here, one models cellular behavior using stateful, concurrent programs that represent biological entities like proteins and genes. Each “component”, or process, of such a program interacts with neighboring processes and changes state when certain events take place (e.g., when a molecular signal is received). Because of the inherent complexity of concurrency, even a small number of components can collectively describe highly nontrivial system behaviors.

Program synthesis allows the automatic extraction of such programmatic models from prior knowledge and data. A 2014 position paper by Fisher, Piterman, and Bodik (building on a POPL 2013 paper by Koksal et al.) lays out a concrete approach of this sort. Here, the process of modeling a biological system follows three steps:

  • Choose a level of abstraction at which the biological system is to be modeled.
  • Construct a partial programmatic model that describes known biological entities and known facts about their states and interactions at the chosen level of abstraction. In particular, some of these facts come from scientific experiments, which tell us what the model should output on a given set of inputs.
  • Use a program synthesis algorithm to automatically complete the partial model, enumerate alternative models, and search for new experiments to conduct.

The authors instantiate these ideas in the task of modeling a set of cells in the C. elegans worm — see the papers for more details. Although the papers developed these ideas in the context of systems biology, the ideas conceivably apply to other natural science settings as well.

The Road Ahead

Fisher et al.’s work happened before the current wave of research on machine learning for science. A challenge with such approaches, in comparison with machine learning approaches, is that they do not offer a first-class treatment of uncertainty in experimental data, instead using data to define hard constraints. Also, automatic symbolic reasoning has high computational complexity, making purely symbolic approaches hard to scale to the massive datasets that increasingly drive many areas of science. At the same time, PL methods make it easy for an AI assistant to work with human-specified scientific knowledge and reason at a high, symbolic level, and their advantages are therefore complementary to those of statistical learning. It is natural to wonder about ways to combine the best of the two worlds.

The area of probabilistic programming, which develops high-level programming languages that allow the definition of complex probability distributions and support probabilistic inference as a first-class operation, offers one path towards such approaches. Differentiable programming, the study of languages of programs that can be trained using gradient descent, offers another. Differentiable programming platforms such as Pytorch are increasingly ubiquitous, and a number of mature probabilistic programming platforms exist as well. However, much more needs to be done on probabilistic/differentiable languages (or languages that generalize these paradigms) that make use of advanced PL abstractions and sophisticated symbolic reasoning techniques. Also, to discover models of the natural world, one needs to go beyond estimating parameters of a given program and develop methods for synthesizing programs from data.

Solving this synthesis problem will likely require some fundamental innovations in the intersection of machine learning and PL. In the year 2019, it is hard to imagine an attempt at large-scale data-driven discovery that doesn’t use deep neural networks. The PL/formal methods challenge will be to develop synthesis algorithms that generalize deep learning using symbolic modules, for example, by using neural modules for lower-level pattern recognition, and abstraction-guided search and automatic theorem-proving for higher-level reasoning. There is an emerging literature on such “neurosymbolic” methods [Ellis et al., 2019, Mao et al. 2019; Valkov et al., 2018; Verma et al., 2019], but the area is in its infancy and much more needs to be done.

I believe that progress on this synthesis problem can have tremendous implications for our understanding of our world. Modern programming languages offer a far richer way of describing the world than traditional calculus-based notations. Methods for algorithmically discovering programs that describe the world can potentially enable massive leaps in science, just as manual methods for solving differential equations did in a previous century. I hope more of you take on this challenge!