On Jupyter
The more I work with undergrads and younger grad students the more I realize that kids these days really love Jupyter. This is a trend that has been picking up for a few years now, even when I was an undergrad there was a noticeable uptick in Jupyter usage amongst people I knew with every passing year. But here’s the thing, I hate Jupyter, there’s just something about it that keeps me from grokking it.
I think my personal dislike of using Jupyter partially comes down to an age thing; my first serious foray in programming was learning C in 2010, Jupyter was released in 2015 and it wasn’t until 2017 when I started working on CLASS that I was really exposed to it. So by the time I first interacted with a notebook interface I was fairly set in my ways. On top of that, to me the more traditional non-interactive model just makes more sense than an interactive notebook interface. Data enters the system, it passes through blocks that map input data to some output data, and then the system outputs data. Of course data in → data out doesn’t make sense for all systems, but for someone like me who mostly writes code for data reduction and analysis pipelines its a very sensible model. I think this is also why functional programming has always held a certain charm to me.
But my dislike of Jupyter isn’t really that important, after all, I can (mostly) just not use it. So why am I generally so grumpy about Jupyter? Well that’s because I also dislike the way many people use it.
What Jupyter is Good At
Before we jump into how I think people misuse Jupyter, lets talk about the things that I think Jupyter is a good tool for.
Teaching
I think that Jupyter notebooks have pretty good pedagogical value. The ability to mark up code with text that renders rather than just comments is rather nice. Comments have the problem that many text editors and IDEs color them in a way that doesn’t draw the eye, making it easy for someone to skim over them when they are reading code. Additionally the block based format encourages having a narrative structure to your code that can help one understand what is going in. But perhaps the most valuable feature Jupyter offers for teaching purposes is the way that it saves inline plots and printed values, this lets the people you send your notebooks with see your outputs and how you generated them at the same time which I find to be fairly valuable especially if there are enough intermediate values shown that you can see how the code transforms the inputs at each stage.
Debugging
Jupyter is can be useful tool for debugging code.
Because you run things cell by cell anyways you get the functionality of breakpoints without any extra work.
Plus since Jupyter cells all use global scope and doesn’t get rid variables them unless you manually overwrite or del
them,
you can inspect the state very easily.
Jupyter also doesn’t clear state when you encounter and error, you are free to modify and rerun the erroneous cell until things work.
Of course you could just use pdb
for debugging,
but I have found that many people have a strong aversion to using an actual debugger.
Prototyping
When developing it can be useful to have a place to write and run snippets of code. This can help you work out the behavior of a function from a library you are unfamiliar with, test your own functions with some dummy data before you drop them in a script that runs on a larger dataset, or even just a place to pull up docstrings for reference. I personally prefer using IPython for these tasks, but Jupyter is perfectly capable of doing of these things.
Plotting
A lot of people use Jupyter to make plots, this can be nice since you can run Jupyter on a remote server and view the plots inline. I don’t really see anything wrong with this, but in my opinion its requires much more overhead than simple X forwarding so I tend to avoid it. If you already are running Jupyter on the server then I suppose this is not a problem.
How People Misuse Jupyter
I think the my problems with the way people use Jupyter boils down to people using it for things that are beyond its scope. In particular I often see people do full blown data analysis. There are a number of problems with this, for me I think the biggest one is the way that Jupyter keep things in scope basically forever. I mentioned earlier that this is useful for debugging purposes, but in production code this can be a blight. While this is usually fine for short test snippets, when you are working on actual data sets this can result in you hogging memory on what are often shared compute resources. Additionally, Jupyter doesn’t free up memory unless you actually kill the kernel for a given notebook or shutdown the server entirely, so often valuable resources are taken up for notebooks that are sitting idle.
Another thing that makes Jupyter annoying in production environments is the additional friction running it adds.
When someone writes a Jupyter notebook to do analysis and sends it to me to run,
I now need to start up a Jupyter server, load the notebook, and then click the run all button (as far as I know there is no default keyboard shortcut for this).
Compare this to a python script where instead I just need to run python script.py
directly from the terminal.
But moving the scope from data analysis pipelines out to programming in general, I think that Jupyter makes it far to easy to build bad programming habits and write messy code. One that that I see in many Jupyter notebooks is people using code as configuration. Many people (myself included) are guilty of using global variable at the top of a file as configuration. But with plain code, I at least find that when those variables are touched often then the friction of switching context to my editor adds enough friction that I am motivated to set them up as command line arguments. And when there are sufficient number the command line arguments are cumbersome enough that I now am motivated to setup some sort of config file system. Compare this to Jupyter where editing and running code happen in the same context, in this case editing variables isn’t really inconvenient at all. I hope I don’t have to explain why using global variables to set behavior is a bad habit, but on top of being a bad habit it’s just a worse UX than the alternatives. With command line arguments I can send someone a line to paste into their terminal to get the intended behavior, with config files I can have multiple git tracked configurations for different cases, with globar variables I force to user to edit source code that they may or may not understand.
Another bad habit that I think Jupyter encourages is messy code.
The rapid prototyping that makes Jupyter so useful has a fatal flaw when compared to IPython:
its not ephemeral.
By this I mean that when you close your Jupyter notebook, you can pick right where you left off on a later date.
This makes it really easy to build off of code that started as a crappy snippet not intended for prime time.
Compare this to IPython which is ephemeral by default
(and when you do run the %save
command the output is not generally in a usable state from the get go),
you are usually forced to rewrite whatever code you were testing in IPython in a separate file which I find improves code quality.
On top of this, the cell based environment of Jupyter allows you to run things out of order and copy cells for edits.
I find that this can lead to nightmarish notebooks with cells that need to run in some non-linear order to work properly and
multiple cells that are iterating on the same chunk of code.
So What Should People Use Instead?
I think that if you keep things within the correct scope Jupyter can be a valuable tool, for teaching purposes I really don’t think there is a great alternative.
Using it to prototype and test code and then transfer things over to a script or a library as you go is a great strategy. But as I stated previously I think that Jupyter makes it too easy for you to slip into bad habits when you do this, so I prefer IPython for prototyping and testing.
For debugging just learn how to use pdb
, a real debugger is well worth the effort to learn.
Or just use an IDE that has debugger support baked in,
most popular options like VSCode and PyCharm can do this.
For plotting I think that Jupyter is harmless, but again you can just X forward most of time. On servers that give you a folder on the webserver for you user I like to make a static site for all my plots with a tool like sigal. A benefit of this approach is you can send someone a link where they can find all of the relevant plots that will be updated as you refine things.
Also if you are a heavy notebook use please use something like VSCode or vim to interact with them instead of using Jupyter directly. The text editing experience in Jupyter is abhorrent.