Marco Gorelli
Marco is a Data Scientist at the Samsung R&D Institute UK. Outside of work, he is a maintainer of pandas (data wrangling platform for Python widely adopted in the scientific computing community) and author of nbQA (code quality tool for Jupyter Notebooks). He holds an MSc in Mathematics and Foundations of Computer Science from the University of Oxford.
Samsung R&D Institute UK, UKMarcoGorelli
Never Have an Unmaintainable Jupyter Notebook Again!
ML conf EU 2020
26 min
Never Have an Unmaintainable Jupyter Notebook Again!
Data visualisation is a fundamental part of Data Science. The talk will start with a practical demonstration (using pandas, scikit-learn, and matplotlib) of how relying on summary statistics and predictions alone can leave you blind to the true nature of your datasets. I will make the point that visualisations are crucial in every step of the Data Science process and therefore that Jupyter Notebooks definitely do belong in Data Science. We will then look at how maintainability is a real challenge for Jupyter Notebooks, especially when trying to keep them under version control with git. Although there exists a plethora of code quality tools for Python scripts (flake8, black, mypy, etc.), most of them don't work on Jupyter Notebooks. To this end I will present nbQA, which allows any standard Python code quality tool to be run on a Jupyter Notebook. Finally, I will demonstrate how to use it within a workflow which lets practitioners keep the interactivity of their Jupyter Notebooks without having to sacrifice their maintainability.