Apache Airflow and Prefect

Big asset managers like BlackRock, Vanguard use custom-built schedulers, cron-based orchestration, enterprise ETL tools, and monolithic job control systems to manage workflow. These setups were often brittle and difficult to manage at scale but were the standard in the 2000s and early 2010s.

Finance had complex, interdependent systems decades before open-source orchestration tools existed. Regulations required auditability, determinism, and log retention, which pushed toward enterprise solutions. Latency and overnight batch cycles meant orchestrating large volumes of data was mission-critical.

But now they have Apache Airflow (launched 2014) and Prefect. They increasingly adopt hybrid setups โ€” e.g., using Airflow on top of legacy systems, or using AWS Step Functions for cloud-native orchestration. Python becoming dominant has made Airflow/Dagster appealing, especially for quant/dev teams.

Here is a simple_dag.py

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

# Define Python functions
def start_task():
    print("Workflow started!")

def print_current_date():
    print(f"Current date and time is: {datetime.now()}")

def end_task():
    print("Workflow completed!")

# Define the DAG
with DAG(
    dag_id='simple_etl_example',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',  # Runs once per day
    catchup=False,               # Only run from today forward
    tags=['example'],
) as dag:

    start = PythonOperator(
        task_id='start',
        python_callable=start_task
    )

    print_date = PythonOperator(
        task_id='print_date',
        python_callable=print_current_date
    )

    end = PythonOperator(
        task_id='end',
        python_callable=end_task
    )

    # Define task order
    start >> print_date >> end

So what’s pro and cons to compare Airflow to Prefect?

CapabilityAirflowPrefect 2.x
Programming modelDeclarative (DAGs defined statically)Imperative (flows written like Python code)
State HandlingTasks run independently; DAG defines flowFlows and tasks manage state natively
Error HandlingRequires explicit try/except, retriesBuilt-in retry, timeout, caching options
Dynamic WorkflowsHarder (e.g., branching, looping)Easy (can use native if, for, etc.)
UI / ObservabilityRich web UI with logs, DAG viewSimple local UI, full cloud dashboard
SchedulingBuilt-in schedulerExternal (Prefect Cloud or CLI)
DeploymentAirflow Scheduler + Workers + DBPrefect Agent + Python environment
ExtensibilityPlugins, Operators, HooksPython-first decorators, integrations

Summary

Use CaseRecommendation
You want a proven, enterprise-grade batch scheduler with lots of integrations๐ŸŸข Use Airflow
You want simple Python-native workflows, better dynamic behavior, and modern DX๐ŸŸข Use Prefect
You’re managing legacy ETL pipelines๐ŸŸข Stick with or migrate to Airflow gradually
You’re starting from scratch or want to move fast in Python๐ŸŸข Prefer Prefect

Hence Prefect fits well.

A simple prefect_flow.py

from prefect import flow, task

# Define the extract task
@task
def extract():
    data = {"AAPL": 150, "MSFT": 300, "GOOG": 2800}
    print("Extracted data:", data)
    return data

# Define the transform task
@task
def transform(data):
    # Add 5% to each price
    transformed_data = {k: round(v * 1.05, 2) for k, v in data.items()}
    print("Transformed data:", transformed_data)
    return transformed_data

# Define the load task
@task
def load(data):
    print("Loading data...")
    for k, v in data.items():
        print(f"{k}: ${v}")

# Define the flow that connects the tasks
@flow
def simple_etl_flow():
    raw_data = extract()
    processed_data = transform(raw_data)
    load(processed_data)

# Run the flow directly
if __name__ == "__main__":
    simple_etl_flow()

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.