This is a draft of the introduction to my student Soroush Samidian’s Ph.D. thesis:Reproducibility is a cornerstone of Science. To be truly reproducible, an experiment should be explicit and thorough in describing every stage of the analysis, starting with the initial question or hypothesis, continuing on through the methodology by which candidate data were selected and analyzed, and finishing with a fully-documented result, including all provenance information (which resource, which version, when, and why). As modern biology becomes increasingly in silico-based, many of these best practices in reproducibility are being managed with much higher efficiency. The emergence of analytical workflows as first-class referenceable and shareable objects in bioinformatics has led to a high level of precision in describing in silico “materials and methods”, as well as the ability to automate collection of highly detailed provenance information. However, the earlier stages in the scientific process – the posing of the hypothesis and the selection of candidate data – are still largely limited to human cognition; we pose our hypotheses in the form of sentences, and we often select and screen candidate data based on expert knowledge or intuition. This is particularly acute in the interface between clinical sciences and molecular sciences, where clinicians are the ultimate arbiters of patient phenotypic classification, often based entirely on their personal expert opinion, while in contrast molecular association studies depend on deeply understanding these classifications in order to make statistical links between phenotypic traits and molecular traits. (read more…)