Dan Chen - Building Reproducible and Replicable Projects
Building Reproducible and Replicable Projects By Dan Chen Abstract: An organized project makes a happy data scientist and data science team. How should your projects be organized? What about the reports your project creates? We hear a lot about pipelines, but how do you make them? How do you start to automate your pipelines, and how do you make your datasets, figures, and reports repliciable and reproducible as new or updated datsets come in? We talk about the various ways you can organize your project from the new R user to seasoned R users, and how to use build systems to keep track of various parts of your pipeline. Bio: Daniel Chen is a doctoral candidate in the interdisciplinary PhD program in Genetics, Bioinformatics & Computational Biology (GBCB) at Virginia Tech. He utilizes data and statistical models to study the spread of disease and the efficacy of medicine and treatments. He holds a masters in Epidemiology from Columbia University and earned his bachelor degree in Psychology from the Macaulay Honors College at CUNY Hunter College with a concentration in behavioral neuroscience and minors in computer science and biology. Daniel specializes in research design, analysis and teaching scientific computing with an emphasis on R, Git, Python and Linux. Currently focused on the study of population health using a data science approach through agent-based modeling. Daniel is the author of Pandas for Everyone, an expansion in the Pearson series---the Python/Pandas complement to R for Everyone. Twitter: @chendaniely Presented at the 2019 New York Conference (May 10th, 2019)