Dan Mazur, PhD

Personal website of Dan Mazur, PhD. Dan is a machine learning engineer in Vancouver, BC.

View My GitHub Profile

Data Science Modeling Collaboration

Guidelines for effective collaboration when multiple data scientists are collaborating to develop a single model.

Common anti-pattern: Copy-and-pasting code between notebooks

Some data science teams will end up developing models collaboratively by maintaing separate branches or repos where they copy-and-paste bits of shared code from their collaborators. Ideas that work are then suggested to be copy-and-pasted back to the other collaborators. There are a number of problems with this strategy:

To avoid these problems, we would like a solution for collaborating on model development that uses some best practices from software engineering, but still allows the researchers to be nimble and try out their ideas quickly.

Better Collaboration

Here’s a suggested strategy that avoids many of the above-listed problems:

Using this approach:

Advice for Kaggle

Kaggle is lousy with the copy-and-pasting code between notebooks problem. Instead of a competition between individual scripts, what if data science competitions required this kind of collaboration on a large, shared code base with a search through capabilities deciding on the best final model. Competitors could be evaluated by the combined lift provided by features they implemented or improved as evaluated in the optimization search. This might produce better results for the contest sponsors with a fraction of the total computing power required. If not, at least it would stop training Kagglers in a questionable model of software collaboration.