Documents

Curr Opin Sys Biol 17

Description
Curr Opin Sys Biol 17
Categories
Published
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Program synthesis meets deep learning for decodingregulatory networks Jasmin Fisher  1,2 and Steven Woodhouse 2 Abstract With ever growing data sets spanning DNA sequencing all theway to single-cell transcriptomics, we are now facing thequestion of how can we turn this vast amount of informationinto knowledge. How do we integrate these large data sets intoa coherent whole to help understand biological programs? Thelast few years have seen a growing interest in machinelearning methods to analyse patterns in high-throughput datasets and an increasing interest in using program synthesistechniques to reconstruct and analyse executable models ofgene regulatory networks. In this review, we discuss the syn-ergies between the two methods and share our views on howthey can be combined to reconstruct executable mechanisticprograms directly from large-scale genomic data. Addresses 1 Department of Biochemistry, University of Cambridge, CambridgeCB2 1QW, UK 2 Microsoft Research, Cambridge CB1 2FB, UKCorresponding author: Fisher, Jasmin, Department of Biochemistry,University of Cambridge, Hopkins Building, Downing Site,Cambridge, CB2 1QW, UK. Fax: +44 1223 479 999. ( jf416@cam.ac.uk) Current Opinion in Systems Biology  2017,  4: 64 –  70 This review comes from a themed issue on  Big data acquisition andanalysis (2017) Edited by  Pascal Falter-Braun and Michael A. Calderwood For a complete overview see the Issue and the Editorial Available online 20 July 2017 http://dx.doi.org/10.1016/j.coisb.2017.07.006 2452-3100/© 2017 Elsevier Ltd. All rights reserved. Introduction Uncovering and understanding the programs that un-derlie the behaviour of cells is one of the major chal-lenges in biology today. Understanding these programswill help us determine the molecular mechanisms of disease, and ultimately impact translational research. A central goal of executable biology  [1] is the constructionof executable mechanistic models of such cellularbehaviour programs, and the development of computa-tional techniques for automated analysis and inferenceof these models. In this review, we discuss two fields of computer science, machine learning and program syn-thesis that are focused on learning predictive modelsfrom data and on automated construction of computerprograms from desired behaviours, respectively. Thesetwo fields can be seen as two sides of the same coin,particularly in the context of executable biology wherewe want to learn from large complex datasets, and wherethe artefact we ideally want to learn is a mechanisticmodel of the cell’s behaviour, which is essentially aprogram.Recent research has begun to blur the boundaries be-tween these two fields, as machine learning researchershave begun to develop methods that can learn algo-rithmic patterns in data and as programming languageresearchers have begun to investigate the methods of deep learning for program synthesis. Here, we explainthe differences between these two approaches, discussthe growing connections between the two fields, andgive our projections for how they can be combined toextract comprehensive cell signalling programs from thetsunami of genomic data. Machine learning for uncovering patterns inlarge data sets Machine learning allows us to approach problems thathave no clear solution as a traditional program auth-ored in human-readable source code [2,3]. Forexample, how would you program a computer torecognise images of cats? There is no direct way to dothis. Instead, one can use a machine learning algo-rithm to train a model on many images of cats, andhave the algorithm learn the underlying patterns inthe data in a way that generalises to new, previously unseen images. Then, when presented with a newimage, the trained model is able to correctly predictwhether it contains a cat or not.Machine learning can be used for classification problems(for example, object recognition  e  detect a face in animage), for regression (i.e., predict a continuous variablegiven some input), or for sample generation (i.e.,generate new objects that are similar to previously seenobjects). A machine learning algorithm takes trainingdata as an input and uses it to estimate a function  f  . Thealgorithm typically does this by optimising a metric thatmeasures how well  f   fits the training data. Machinelearning is distinguished from a regular optimisationproblem by the requirement for generalisation to new,previously unseen data (in order not to over fit thetraining data). To test how well the trained modelgeneralises, we evaluate the performance measure ontest data, which is disjoint from the training data. Available online at www.sciencedirect.com ScienceDirect  Current Opinion in Systems Biology Current Opinion in Systems Biology  2017,  4: 64 – 70 www.sciencedirect.com  Classical approaches to machine learning ‘Classical’ machine learning approaches are based uponthe careful extraction of   features   of a data set. Thesefeatures are then plugged into a standard regression orclassification algorithm, such as linear regression, naı¨vebayes, or the  k -nearest neighbours classifier (see Ref. [3]for an overview of these classic algorithms). The successor failure of the machine learning algorithm to makeaccurate predictions is largely dependent on the fea-tures it is presented with. In areas such as speechrecognition and image classification, features of interestcan be highly complex and must be thoroughly designedby hand by a domain expert, a process known as  feature  engineering  . An example of using classical machine learning tech-niques in biology is the prediction of gene expressionlevels from transcription factor binding profiles usinglinear regression [4]. Yu et al. map genotype to pheno-type in yeast using random forests with gene ontology terms as features [5]. Deng et al. built a predictivesystem for classifying genes as essential or non-essentialin bacteria by integrating gene expression, protein e protein interaction and genomic data, averaging thepredictions of an ensemble of different models [6]. Applications of machine learning techniques to generegulatory network reconstruction have focused ondetection of statistical signals in gene expression datausing clustering [7 e 9], correlation [10,11], mutual in- formation [12], Bayesian networks [13] or random for- ests [14]. Deep neural networks Deep learning, the application of deep neural networksto machine learning, is the current state-of-the-art insupervised learning [2]. While the explosion of deeplearning research is recent, researchers have beenworking on the underlying models, artificial neuralnetworks, since the 1940s, initially as computationalmodels of the brain. Early applications of deep learningin biology include the use of neural networks to deci-pher the complex tissue-specific splicing regulatory code and to predict DNA-protein binding [15,16].There are two major features that distinguish deeplearning from classical approaches to machine learning.The first is that neural networks can represent essen-tially any (continuous) function, rather than simplefunctions of a specific form [17]. This property is true of shallow neural networks, as well as deep ones. Thesecond major difference between deep learning andclassical machine learning is that deep neural networksperform  representation learning  . Representation learningsolves the feature engineeringproblem faced by classicalapproaches to machine learning, mentioned above. In adeep neural network, the features themselves can alsobe learnt from the raw data, automatically. Figure 1shows how a deep neural network can learn to repre-sent the concept of an image of a person by building ahierarchy of representations, of increasing levels of abstraction [2].Figure 2 shows an illustration of a deep neural network architecture for predicting the sequence specificities of DNA- and RNA-binding proteins [15]. The neuralnetwork is trained on sequence specificities measuredby a range of experimental methodologies. The network then learns to generalise the patterns it finds in this datato discover both motifs and an associated scorepredicting their binding affinity. The resulting trainedmodel can then be used to identify new binding se-quences or predict the effect of DNA or RNA mutations.In a neural network, components called neurons areconnected into a graph. A neuron receives a signal fromeach of its input neighbours, takes a weighted combi-nation of these inputs (according to learned weights)and passes the result through an activation function todetermine its output signal. In a deep neural network,these outputs are in turn fed into other neurons, andmany layers can be stacked, with the input to the systemarriving at the first layer and the output of the systemarriving from the final layer.To train a neural network, we define an  objective function that measures how well the network outputs fit ourtraining data. In modern machine learning, the objectivefunction and the activation functions are differentiable,meaning that a small change in the weights of a neuronresult in small changes to its output. We can track theeffect of changing each neuron weight on the resultingobjective value, and use a local optimisation algorithm toiteratively update all the weights to optimise theobjective. Because the objective function associatedwith a neural network is in general highly non-convex, alocal optimisation algorithm may get stuck at a locally optimal, but not globally optimal choice of weights. It istherefore somewhat surprising that deep learning worksso well. For image and speech recognition tasks, localoptima which are far from the value of the global opti-mum do not seem to be a problem in practice. There hasbeen some recent theoretical work on trying to under-stand this [18 e 21]. However, the presence of localoptima seems to be a much larger issue for using deeplearning approaches to synthesise programs, as wediscuss next. Program synthesis for reconstructing generegulatory networks Program synthesis is a method for automatically constructing a program that satisfies a given set of desired behaviours [22 e 25]. The set of behaviours canbe given as a logical formula or as a set of input e Program synthesis meets  Fisher and Woodhouse 65 www.sciencedirect.com  Current Opinion in Systems Biology  2017,  4: 64 – 70  output examples that the program should reproduce, oras some combination of the two. For example, we may want a program that can sort a list of integers. Ratherthan directly writing such a program, we may ask thecomputer to automatically find one for us. Unlike deeplearning, program synthesis generally leads to discreteproblems which can be exactly solved to obtain aglobally optimal solution, using algorithms thatleverage SAT, SMT, or integer linear programmingsolvers. Figure 2Deep learning in biology . A deep neural network architecture for predicting the sequence specificities of DNA- and RNA-binding proteins. Reproduced,with permission, from Ref. [15]. Figure 1Representation learning . Deep neural networks learn to represent concepts in terms of progressively simpler ones. At the lowest level of the network,raw pixel values are input into the model. The next layer of the model identifies edges, by comparing the brightness of neighbouring pixels. The third layer takes the representation of edges, and uses them to represent corners and contours. The forth layer is able to detect entire parts of specific objects, bycombing together contours and corners. Finally, the model outputs a classification of the image which it determines based upon the object parts fed fromthe forth layer. Crucially, none of these abstract concepts is provided  a-priori   by the programmer. Instead, they are directly learnt from the raw data.Reproduced, with permission, from Ref. [2]. 66  Big data acquisition and analysis (2017) Current Opinion in Systems Biology  2017,  4: 64 – 70 www.sciencedirect.com  The Single Cell Network Synthesis toolkit (SCNS) is amethod for synthesising executable models of generegulatory networks in the form of Boolean networksfrom single-cell gene expression data [26,27]. SCNS isbased upon viewing single-cell gene expression profilesas though they were states of an asynchronous Booleannetwork, and then solving the problem of reconstructinga Boolean network from its state space. This algorithmuses a combination of enumerative search, graphreachability and Boolean satisfiability solving to extract agene regulatory network model that best matches thestate space data (Figure 3 A). Before SCNS can be used,gene expression data first must be discretised to binary data, where continuous gene expression values areconverted to Boolean on/off values.The SCNS approach can be applied to study develop-mental processes, and requires measurement of suffi-cient single-cells to get reasonable coverage of a systemacross a time course. We applied this methodology tostudy early blood development in the mouse embryo,capturing nearly 4000 cells with blood-forming potentialacross four sequential time points. We designed thisexperiment so that approximately one embryo equiva-lent of cells was collected at each time point, giving acomprehensive single-cell resolution picture of thedevelopmental process and allowing us to find a modelthat can explain transitions from early cell states to latecell states. Once a model has been found, it can be usedto make predictions which can be validated experi-mentally. If the model predicts that transcription factors A and B are both required for activation of C, experi-ments can be designed that mutate binding sites for A and B individually and in combination and assess theeffects on the expression of C. If the model predicts thatoverexpression or knockout of a specific gene eliminatesor adds model states, this can be tested via corre-sponding overexpression/knockout studies.The Reasoning Engine for Interaction Networks(RE:IN) synthesises Boolean networks from priorknowledge of the gene regulatory network connectionstogether with a set of desired stable states that theconstructed model should have [28] (Figure 3B). The search for a compatible model is encoded as a logicalformula and solved using an SMT solver. Again, thisspecification is discrete, given by binary stable statesand the presence of network edges. Dunn et al. usedthis method to reconstruct a minimal Boolean network model that can explain embryonic stem cellpluripotency. Connections between machine learning andprogram synthesis for learning programsfrom data Recently, machine learning researchers have begun toextend their deep neural network models so that they can learn algorithmic patterns in data. At the same time,researchers working in program synthesis have begun toinvestigate the methods of deep learning for programsynthesis, and so these two fields have begun to overlap.Below we survey the latest developments in these fields,and in the next section we will discuss how these ad-vances could be used to improve methods for synthe-sising biological models.Deep learning researchers have found that by augmenting networks with an external data structuresuch as a tape, stack or list, they can train models tolearn simple algorithms [29 e 38]. Compared to regularprograms, there is no interpretable source code repre-sentation for these trained models. They are black boxesgiven by a huge number of parameters on a neuralnetwork, and they can only be understood by their ac-tions on given inputs [39].On the other side, in the programming languages com-munity, there has been work on applying the methods of differentiable models and deep learning to the problemof synthesising (the source code of) programs. The toolTerpreT has been developed to understand the capa-bilities of machine learning techniques relative to Figure 3Executable gene regulatory network model synthesis . Boolean network models synthesised by SCNS (A) and REIN (B), from single-cell geneexpression data and from relevance networks + stable state specifications respectively. Reproduced, with permission, from Refs. [27] and [28]. Program synthesis meets  Fisher and Woodhouse 67 www.sciencedirect.com  Current Opinion in Systems Biology  2017,  4: 64 – 70
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks