1,693 reads

Why Jupyter Notebooks are the Future of Data Science

by Rick LamersAugust 7th, 2020

Too Long; Didn't Read

Rick Lamers explains why Jupyter Notebooks played an important role in the rise in popularity of Data Science and why they are its future. He explains why he decided to build Orchest to leverage and further contribute to its success. He says the benefits of the open source project have been instrumental to the rise of Jupterbooks' rise in data science. The value of tools lies in making this as frictionless as possible. The iterative and experimental nature of data science makes it fundamentally different from regular software engineering.

Company Mentioned

featured image - Why Jupyter Notebooks are the Future of Data Science

How Jupyter Notebooks played an important role in the incredible rise in popularity of Data Science and why they are its future.

Nowadays, many individuals and teams are flocking to the tools and techniques that enable them to leverage large amounts of data. What makes Jupyter Notebooks so appealing to data scientists?In this article I will dive into some of the underlying trends that have contributed to the success of Jupyter Notebooks and why I decided to build to leverage and further contribute to its success.

Underlying technologies of data science

Something that is less talked about is the connection between the many advances of machine learning and data science, and the underlying technologies that have been developed over the past decades. Specifically I'm talking about programming languages such as Python, operating systems like Linux, compiler infrastructure like LLVM, and version control systems such as Git. Just to name a few. It's important to realize that fundamental projects like these have enabled the vast growth and advances in machine learning and data science.

The previously mentioned technologies have, among others, created fertile ground for individuals and companies to start leveraging data science tools and techniques. However, in order to leverage these technologies data scientists need to find a way to use them without requiring a significant time investment.

Hiding complexity

Technological building blocks are crucial when it comes to dealing with complexity. The modern computing stack has done an outstanding job of layering systems to make sure that whenever you want to perform a task, you are not encumbered with the many lower-level implementation details. Take for example the seemingly simple task of interacting with files. A simple Python snippet

file = open('hello.txt', 'w')

executes many low-level operations under the hood in order to give the engineer a high level, easy to use abstraction to interact with files. Having the high-level concept and implementation of files available increases programming productivity by orders of magnitude.

For data science, high level frameworks such as TensorFlow let you define complex layered neural networks with just a few lines of code:

model = tf.keras.models.Sequential([

tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

Jupyter Notebooks are great for hiding complexity by allowing you to interactively run high level code in a contextual environment, centered around the specific task you are trying to solve in the notebook.

By ever increasing levels of abstraction data scientists become more productive, being able to do more in less time. When the cost of trying something is reduced to almost zero, you automatically become more experimental, leading to better results that are difficult to achieve otherwise.

Experimentation driven development

According to Wikipedia [1] Data Science is defined as:

"an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data."

It is exactly the application of the scientific methods that requires you to be able to run many experiments in order to validate your hypotheses. The value of tools therefore lies in making this as frictionless as possible.

When data science is compared to traditional software engineering this point is often overlooked. The iterative and experimental nature of data science makes it fundamentally different from regular software engineering for which the development process is more planned out and less explorative.

Interactive computing

It is incredibly powerful to get immediate feedback when programming. Computing the outcome of your actions in realtime and consuming them in an easily digestible way, enables you to quickly draw conclusions about what works and what doesn’t.In a great talk by Bret Victor this principle is demonstrated through a clever example of how interactive feedback can help you find the best solution to a game design problem:

Power of immediate feedback: visualizing the consequences of your changes. Let the character make the jump by finding the correct y-velocity. [2]

I believe that the benefits of immediate feedback have been instrumental to Jupyter Notebooks' rise in popularity. Notebooks enable you to rapidly try ideas and experiment by providing you with immediate feedback when executing snippets of code. Through their cell-based structure and markdown support, they provide a scratchpad for your ideas which facilitates exploratory work even further.

The Jupyter open source project has pioneered many of the concepts around interactive programming for data science and has built a great community around its ecosystem. To guarantee Jupyter Notebooks keep improving and to ensure that they are indeed the future of Data Science it's important to collaborate and rally around standardized and open source solutions.

How we're contributing with Orchest

To contribute to the collection of open source tools in the data science ecosystem my co-founder Yannick Perrenet and I decided to start . Orchest is an open source tool to supercharge your Jupyter workflow. It allows you to create data science pipelines that consist of individual Jupyter Notebooks as pipeline steps, combining the advantages of interactive notebooks with those of data pipelines.

Through our personal experience as data scientists, we have discovered that significant technical complexity arises when doing large scale data science projects. Our mission is to make it painless and simple to leverage Jupyter Notebooks in cloud based environments while collaborating with others. By integrating Jupyter Notebooks in Orchest we believe we can leverage the strengths of notebooks to make them an even better tool for modern data science.As of today we are still at the very beginning of this journey with our just starting to take shape. We very much welcome contributions and suggestions from the wider community to further develop the software for a broad and diverse data science audience.In another article we will dive into what exactly our vision is for Orchest. We will give concrete examples of current pain points for data scientists, how we are solving them with Orchest today and how we are planning to address more challenges in the future. Stay tuned!

[1]
[2] Bret Victor - Inventing on Principle