Back to Thoughts
Data EngineeringETLArchitecture

Building Data Pipelines That Actually Scale

From messy CSVs to production-grade ETL — how to design data pipelines using batch and stream processing, with architecture patterns for every stage.

Feb 20, 202514 min read

Building Data Pipelines That Actually Scale

Every data pipeline starts the same way: a Python script reading a CSV, transforming it with Pandas, and dumping it into a database. Six months later, that script is a cron job processing 50GB nightly, and it's everyone's problem.

Here's how to build pipelines that grow with your data — from simple ETL scripts to production-grade architectures.

The ETL Pipeline: Where It All Begins

Classic ETL Pipeline

The three stages are always the same. What changes is the scale and the latency requirements.

Batch vs Stream: The Core Decision

Kappa Architecture

Batch for when you need completeness and can tolerate latency. Stream for when you need freshness. The Kappa Architecture says: if your stream is replayable (Kafka), you only need one pipeline.

The Modern Data Platform

Modern Data Platform Architecture

This is the architecture I've built and operated. The key layers:

  • Ingestion: Kafka Connect for streaming, CDC (Debezium) for database change capture
  • Storage: Data lake with zone-based organization — raw (untouched), staging (cleaned), curated (business-ready)
  • Processing: Spark for heavy transforms, dbt for SQL-based modeling, Python for custom ETL logic
  • Serving: Data warehouse for BI tools and dashboards
  • Data Quality: The Pipeline Nobody Builds (Until It's Too Late)

    Data Quality Pipeline

    Every record passes through a quality gate before reaching the curated layer. Failed records go to a dead letter queue for investigation — never silently dropped, never blocking the pipeline.

    File Formats Matter More Than You Think

    File Format Comparison

    Parquet for anything analytical — columnar compression means you read only the columns you query. Avro for streaming because it supports schema evolution (add fields without breaking consumers). CSV is for export only.

    Lessons from Production

  • Idempotency: Every pipeline step must be safe to re-run. Use UPSERT not INSERT. Use watermarks not timestamps.
  • Backfill capability: You will need to reprocess historical data. Design for it from day one.
  • Schema evolution: Fields get added, types change, columns get deprecated. Use a schema registry.
  • Monitoring: Track row counts, processing times, null rates, and freshness. Alert on anomalies, not just failures.
  • The best data pipeline is the one your team can debug at 3am without you. Document the data flow, name things clearly, and make every failure visible.

    Get in touch

    LET'S WORK
    TOGETHER.

    anshulmanyam275@gmail.com