Joel Grus - The case against the jupyter notebook

Published: July 16, 2019, 2:58 a.m.

To most data scientists, the jupyter notebook is a staple tool: it\u2019s where they learned the ropes, it\u2019s where they go to prototype models or explore their data \u2014 basically, it\u2019s the default arena for their all their data science work. 

\n

But Joel Grus isn\u2019t like most data scientists: he\u2019s a former hedge fund manager and former Googler, and author of Data Science From Scratch. He currently works as a research engineer at the Allen Institute for Artificial Intelligence, and maintains a very active Twitter account

\n

Oh, and he thinks you should stop using Jupyter noteoboks. Now. 

\n

When you ask him why, he\u2019ll provide many reasons, but a handful really stand out:

\n
    \n
  • Hidden state: let\u2019s say you define a variable like a = 1 in the first cell of your notebook. In a later cell, you assign it a new value, say a = 3 . This results is fairly predictable behavior as long as you run your notebook in order, from top to bottom. But if you don\u2019t\u2014or worse still, if you run the a = 3 cell and delete it later \u2014 it can be hard, or impossible to know from a simple inspection of the notebook what the true state of your variables is. 
  • \n
  • Replicability: one of the most important things to do to ensure that you\u2019re running repeatable data science experiments is to write robust, modular code. Jupyter notebooks implicitly discourage this, because they\u2019re not designed to be modularized (awkward hacks do allow you to import one notebook into another, but they\u2019re, well, awkward). What\u2019s more, to reproduce another person\u2019s results, you need to first reproduce the environment in which their code was run. Vanilla notebooks don\u2019t give you a good way to do that. 
  • \n
  • Bad for teaching: Jupyter notebooks make it very easy to write terrible tutorials \u2014 you know, the kind where you mindlessly hit \u201cshift-enter\u201d a whole bunch of times, and make your computer do a bunch of stuff that you don\u2019t actually understand? It leads to a lot of frustrated learners, or even worse, a lot of beginners who think they understand how to code, but actually don\u2019t.
  • \n
\n

Overall, Joel\u2019s objections to Jupyter notebooks seem to come in large part from his somewhat philosophical view that data scientists should follow the same set of best practices that any good software engineers would. For instance, Joel stresses the importance of writing unit tests (even for data science code), and is a strong proponent of using type annotation (if you aren\u2019t familiar with that, you should definitely learn about it here). 

\n

But even Joel thinks Jupyter notebooks have a place in data science: if you\u2019re poking around at a pandas dataframe to do some basic exploratory data analysis, it\u2019s hard to think of a better way to produce helpful plots on the fly than the trusty ol\u2019 Jupyter notebook. 

\n

Whatever side of the Jupyter debate you\u2019re on, it\u2019s hard to deny that Joel makes some compelling points. I\u2019m not personally shutting down my Jupyter kernel just yet, but I\u2019m guessing I\u2019ll be firing up my favorite IDE a bit more often in the future.