Etl In Airflow Aided By Jupyter Notebooks And Papermill

April 17, 2024 Post a Comment

So my issue is that I build ETL pipelines in Airflow, but really develop and test the Extract, Transform and Load functions in Jupyter notebooks first. So I end up copy-pasting bac

Solution 1:

[Disclaimer: I am one of the committers for the mentioned open source project.] We've created Elyra - a set of JupyterLab extensions - to streamline exactly this kind of work. We've just released version 2.1, which provides a visual editor that you can use to assemble pipelines from notebooks and Python scripts (R support should be available soon) and run them on Apache Airflow, Kubeflow Pipelines, or locally in JupyterLab. For Airflow (running on Kubernetes) we've created a custom operator that takes care of housekeeping and execution. I've wrote a summary article about it that you can find here and we've got a couple of introductory tutorials if you are interested in trying this out.

Solution 2:

A single master Jupyter notebook, with any number of slave notebooks (used as templates), executed in sequence using papermill.execute_notebook, should be sufficient to automate any ML pipeline.

To pass information between pipeline stages (from one slave notebook to the next one(s)), it's possible to use another Netflix package, scrapbook, which allows us to record python objects in slave notebooks (as they are processed by papermill) and then to retrieve these objects from slaves in the pipeline master (saving uses scrapbook.glue and reading - scrapbook.read_notebook).

Resuming from any completed stage is also possible but it requires storing necessary inputs saved during previous stage(s) in a predictable place reachable from the master (e.g. in a local master JSON file or in MLflow).

The master notebook can be also scheduled with a cron job, e.g. from Kubernetes).

Alternatives

Airflow is probably an overkill for most ML teams due to admin costs (5 containers, incl. 2 databases), while other (non-Netflix) python packages would either require more boilerplate (Luigi) or extra priviledges and custom docker images for executors (Elyra), while Ploomber would expose to few-maintainers risk.

Solution 3:

It is possible to use Jupyter Notebooks in your Airflow pipeline, as you suggest, via Papermill. However, one of the advantages of Airflow is that you can separate your pipeline into discrete steps, that are independent of each other, so if you decide to write the whole pipeline in one Jupyter Notebook, then that defeats the purpose of using Airflow.

So, assuming that each one of your discrete ETL steps lives in a separate Jupyter Notebook, you could try the following:

Create one Jupyter Notebook for each step. For example, copy_data_from_s3, cleanup_data, load_into_database (3 steps, one notebook for each).
Ensure that each notebook is parametrized per the Papermill instructions. This means, add a tag to each cell that declares variables that can be parametrized from outside.
Ensure these notebooks are findable by Airflow (e.g. in the same folder as where the DAG lives)
Write functions that will use Papermill to parametrize and run your notebooks, one for each step. For example:

import papermill as pm
# ...# define DAG, etc.# ...defcopy_data_from_s3(**context):
    pm.execute_notebook(
           "copy_data_from_s3_step.ipynb",
           "copy_data_from_s3_step.ipynb"
            parameters=dict(date=context['execution_date'])) # pass some context parameter if you need to
        )

Finally, set up the step, perhaps as a PythonOperator (although you can also use a BashOperator if you want to run Papermill from the command line). To match the function from above:

copy_data = PythonOperator(dag=dag,
                           task_id='copy_data_task',
                           provide_context=True,
                           python_callable=copy_data_from_s3)

Solution 4:

Airflow has a papermill operator, but development experience isn't great. One of the main issues with Python-based DAGs in Airflow is that they are executed in the same Python environment, which can cause dependency problems as soon as you have more than one DAG. See this for more details.

If you are willing to give a new tool a try, I'd recommend you to use Ploomber (Disclaimer: I'm the author), which can orchestrate notebook-based pipelines (it uses papermill under the hood). You can develop locally and export to Kubernetes or Airflow.

If you want to know how a project in Ploomber looks like, feel free to take a look at the examples repository.

Solution 5:

Why do you want the ETL jobs as jupyter notebook. What advantage do you see? The Notebooks are generally meant for building a nice document with live data. The ETL jobs are supposed to be scripts running in the background and automated.

Why can't these jobs be plain python code instead of notebook?

Also when you run the notebook using PapermillOperator the output of the run will be another notebook saved somewhere. It is not that friendly to keep checking these output files.

I would recommend writing the ETL job in plain python and run it with PythonOperator. This is much more simpler and easier to maintain.

If you want to use the notebook for it's fancy features, that is a different thing.

lacucinadiadine