Best Practices in Data Science

Readings

Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., … Wilson, P. (2014). Best Practices for Scientific Computing. PLOS Biology, 12(1), e1001745. https://doi.org/10.1371/journal.pbio.1001745
R for Reproducible Scientific Analysis by Software Carpentry

Best practices for data science

These are the best practices for data science recommended by Wilson et al:

Write programs for people, not computers.
1. a program should not require its readers to hold more than a handful of facts in memory at once
2. make names consistent, distinctive, and meaningful
3. make code style and formatting consistent
Let the computer do the work
1. make the computer repeat tasks
2. save recent commands in a file for re-use
3. use a build tool to automate workflows
Make incremental changes
1. work in small steps with frequent feedback and course correction
2. use a version control system
3. put everything that has been created manually in version control
Don’t repeat yourself (or others)
1. every piece of data must have a sin- gle authoritative representation in the system
2. modular- ize code rather than copying and pasting
3. re-use code instead of rewriting it
Plan for mistakes
1. add assertions to programs to check their operation
2. use an off-the-shelf unit testing library
3. turn bugs into test cases
4. use a symbolic debugger
Optimize software only after it works correctly
1. use a profiler to identify bottlenecks
2. write code in the highest-level language possible
Document design and purpose, not mechanics.
1. document interfaces and reasons, not implementations
2. refactor code in preference to explaining how it works
3. embed the documentation for a piece of software in that software
Collaborate
1. use pre-merge code reviews
2. use pair programming when bringing someone new up to speed and when tackling particularly tricky problems
3. use an issue tracking tool

In small groups, each picks (or is assigned) a practice and discusses:

What the current practice? why it is not a good practice and why the best practice is better?
How we can move the current practice to the best practice?