I've worked as a data scientist for quite awhile now in IC, lead and manager rol...

_coveredInBees · on Jan 12, 2022

I agree with everything you said above and that is exactly how we have always had things at my place of employment (work at a small ML/Algorithm/Software development shop). That being said, the one thing I really don't understand is why Notebooks are essential even for EDA. I guess if you were doing things in Notepad++ or a pure REPL shell, they are handy, but using a powerful IDE like Pycharm makes Notebooks feel very very limiting in comparison.

Browsing code, underlying library imports and associated code, type hinting, error checking, etc., are so vastly superior in something like Pycharm that it is really hard to see why one would give it all up to work in a Notebook unless they never matured their skillsets to see the benefits afforded by a more powerful IDE? I think notebooks can have their place and are certainly great for documenting things with a mix of Markdown, LaTeX and code, as well as for tutorials that someone else can directly execute. And some of the interactive widgets can also make for nice demos when needed.

Notebooks also make for poor habits often times and as you mentioned, having data scientists and ML engineers write code as modules or commit them via pull-requests helps them grow into being better software engineers which in my experience is almost a necessity to be a good and effective data scientist and ML engineer.

And lastly, version controlling notebooks is such a nightmare. Nor is it conducive to code reviews.

kristjansson · on Jan 12, 2022

There's an advantage to long-lived interpreters/REPLs on remote machines for the kind of work done in notebooks. Significant amounts of data may have to be read, expensive computation performed, etc. before the work can begin. Notebooks are an ergonomic interface to that sort of environment if one isn't comfortable with ssh/screen/X-forwarding/etc, and frankly nice for some tasks even if one is.

There's also a tacit advantage to notebooks specifically for Python as the interface encourages the user to write all of their definitions in a single namespace. So, the user can define and re-define things at their leisure within a single REPL/interpreter lifetime. A user developing against import-ed modules can quickly get stuck behind python's inability to cleanly re-import a modules, or be forced to rely on flaky hacks to the import system.

It pains me a bit to make the argument _for_ notebooks, but it's important to understand the attractions.

_coveredInBees · on Jan 13, 2022

Thanks for sharing that perspective! It was helpful to get that POV. I agree that a requirement for long lived interpreters and a simpler UX to get up and running probably makes it an attractive option.

With VSCode having such excellent remote development capabilities now however, it feels like a nicer option these days but I guess only if you really care about the benefits that brings. Agreed about reimporting libraries still being a major pain point in Python, but the "advantage" for Jupiter Notebooks is also unfortunately what leads to terrible practices and bad engineering as most non-disciplined engineers end up treating it as one giant script for spaghetti code to get the job done.

kelseyfrog · on Jan 12, 2022

When EDA involves rendering tables or graphics, notebooks provide a faster default feedback loop. Part of this comes from the assumption that the kernel holds state and data loading, transformations, and viz can be ran incrementally and without switching context. That's not to say that it's not possible to do with a python repl and terminal with image support, but that's essentially the value prop of notebooks. Terrible for other things like shipping code, but very good for interactive sessions like EDA work.

Personally, I find myself prototyping in notebooks and then refactoring into scripts very often and productively.

kelseyfrog · on Jan 12, 2022

I've found myself in a data science group by merger and this(what type of artifact to ship) is a current team discussion point. Would you be willing to let me pick your brain on this topic in depth?

Fomite · on Jan 12, 2022

This is how my lab works. We do a lot of prototyping, exploring, making sure everything seems to be working, etc. and then pack it all into reasonably well documented standard code.

Learned this the hard way after working for a group for awhile with a single shared notebook I had nicknamed "The wall of madness".