Best practices for data science
These are the best practices for data science recommended by Wilson et al:
- Write programs for people, not computers.
- a program should not require its readers to hold more than a handful of facts in memory at once
- make names consistent, distinctive, and meaningful
- make code style and formatting consistent
- Let the computer do the work
- make the computer repeat tasks
- save recent commands in a file for re-use
- use a build tool to automate workflows
- Make incremental changes
- work in small steps with frequent feedback and course correction
- use a version control system
- put everything that has been created manually in version control
- Don’t repeat yourself (or others)
- every piece of data must have a sin- gle authoritative representation in the system
- modular- ize code rather than copying and pasting
- re-use code instead of rewriting it
- Plan for mistakes
- add assertions to programs to check their operation
- use an off-the-shelf unit testing library
- turn bugs into test cases
- use a symbolic debugger
- Optimize software only after it works correctly
- use a profiler to identify bottlenecks
- write code in the highest-level language possible
- Document design and purpose, not mechanics.
- document interfaces and reasons, not implementations
- refactor code in preference to explaining how it works
- embed the documentation for a piece of software in that software
- Collaborate
- use pre-merge code reviews
- use pair programming when bringing someone new up to speed and when tackling particularly tricky problems
- use an issue tracking tool
In small groups, each picks (or is assigned) a practice and discusses:
- What the current practice? why it is not a good practice and why the best practice is better?
- How we can move the current practice to the best practice?