Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've worked as a data scientist for quite awhile now in IC, lead and manager roles and the biggest thing I've found is that data scientists cannot be allowed to live exclusively in notebooks.

Notebooks are essential for the EDA and early prototyping stages but all data scientists should be enough "software engineer" to get their code out of their notebook and into a reusable library/package of tools shared with engineering.

The best teams I've worked on the hand off between DS and engineering is not a notebook, it's a pull request, with code review from engineers. Data scientists must put their models in a standard format in a library used by engineering, they must create their own unit tests, and be subject to the same code review that engineer would. This last step is important: my experience is that many data scientists, especially coming from academic research, are scared of writing real code. However after a few rounds of getting helpful feedback from engineers they quickly realize how to write code much better.

This process is also essential because if you are shipping models to production, you will encounter bugs that require a data scientist to fix that an engineer cannot solve alone. If the data scientists aren't familiar with the model part of the code base this process is a nightmare, as you have to ask them to dust of questionable notebooks from months or years ago.

There are lots of the process of shipping a model to production that data scientists don't need to worry about, but they absolutely should be working as engineers at the final stage of the hand off.



I agree with everything you said above and that is exactly how we have always had things at my place of employment (work at a small ML/Algorithm/Software development shop). That being said, the one thing I really don't understand is why Notebooks are essential even for EDA. I guess if you were doing things in Notepad++ or a pure REPL shell, they are handy, but using a powerful IDE like Pycharm makes Notebooks feel very very limiting in comparison.

Browsing code, underlying library imports and associated code, type hinting, error checking, etc., are so vastly superior in something like Pycharm that it is really hard to see why one would give it all up to work in a Notebook unless they never matured their skillsets to see the benefits afforded by a more powerful IDE? I think notebooks can have their place and are certainly great for documenting things with a mix of Markdown, LaTeX and code, as well as for tutorials that someone else can directly execute. And some of the interactive widgets can also make for nice demos when needed.

Notebooks also make for poor habits often times and as you mentioned, having data scientists and ML engineers write code as modules or commit them via pull-requests helps them grow into being better software engineers which in my experience is almost a necessity to be a good and effective data scientist and ML engineer.

And lastly, version controlling notebooks is such a nightmare. Nor is it conducive to code reviews.


There's an advantage to long-lived interpreters/REPLs on remote machines for the kind of work done in notebooks. Significant amounts of data may have to be read, expensive computation performed, etc. before the work can begin. Notebooks are an ergonomic interface to that sort of environment if one isn't comfortable with ssh/screen/X-forwarding/etc, and frankly nice for some tasks even if one is.

There's also a tacit advantage to notebooks specifically for Python as the interface encourages the user to write all of their definitions in a single namespace. So, the user can define and re-define things at their leisure within a single REPL/interpreter lifetime. A user developing against import-ed modules can quickly get stuck behind python's inability to cleanly re-import a modules, or be forced to rely on flaky hacks to the import system.

It pains me a bit to make the argument _for_ notebooks, but it's important to understand the attractions.


Thanks for sharing that perspective! It was helpful to get that POV. I agree that a requirement for long lived interpreters and a simpler UX to get up and running probably makes it an attractive option.

With VSCode having such excellent remote development capabilities now however, it feels like a nicer option these days but I guess only if you really care about the benefits that brings. Agreed about reimporting libraries still being a major pain point in Python, but the "advantage" for Jupiter Notebooks is also unfortunately what leads to terrible practices and bad engineering as most non-disciplined engineers end up treating it as one giant script for spaghetti code to get the job done.


When EDA involves rendering tables or graphics, notebooks provide a faster default feedback loop. Part of this comes from the assumption that the kernel holds state and data loading, transformations, and viz can be ran incrementally and without switching context. That's not to say that it's not possible to do with a python repl and terminal with image support, but that's essentially the value prop of notebooks. Terrible for other things like shipping code, but very good for interactive sessions like EDA work.

Personally, I find myself prototyping in notebooks and then refactoring into scripts very often and productively.


I've found myself in a data science group by merger and this(what type of artifact to ship) is a current team discussion point. Would you be willing to let me pick your brain on this topic in depth?


This is how my lab works. We do a lot of prototyping, exploring, making sure everything seems to be working, etc. and then pack it all into reasonably well documented standard code.

Learned this the hard way after working for a group for awhile with a single shared notebook I had nicknamed "The wall of madness".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: