Angielski dla Data Engineera — pipeline, ETL i architektura danych po angielsku
Pipeline padł w nocy, Data Scientist z Berlina czeka na wyjaśnienie po angielsku. Poznaj 28 terminów Data Engineera — ETL, orchestration, data contracts i architektura danych w języku angielskim.
Pipeline padł o 3:14 w nocy — a Data Scientist z Berlina czeka na wyjaśnienie po angielsku
Przychodzisz rano i widzisz alert: ETL job failed at 03:14 UTC. W skrzynce czeka wiadomość od Data Scientista z berlińskiego biura: "The dashboard shows no data for yesterday — what happened and which datasets are affected?" Za godzinę stand-up z Product Managerem. Musisz po angielsku wyjaśnić root cause błędu transformacji, opisać schemat danych, który się zmienił bez Twojej wiedzy, podać impact assessment i przedstawić plan naprawy.
Pipelines padają, schematy się zmieniają, dane mają opóźnienia — a komunikacja o tym musi być precyzyjna technicznie i zrozumiała dla stakeholderów, którzy nie są inżynierami.
Z tego artykułu skorzystają: Data Engineer, Senior Data Engineer, Analytics Engineer, Data Platform Engineer, DataOps Engineer oraz Big Data Engineer — każdy, kto buduje i utrzymuje infrastrukturę danych w środowisku anglojęzycznym.
28 terminów — 4 kategorie
Data pipeline & ETL — 8 terminów
| EN Term | PL Tłumaczenie | Przykład zdania |
|---|---|---|
| ETL (extract, transform, load) | wyodrębnianie, transformacja, ładowanie | "Our ETL pipeline extracts raw data from the CRM, applies business logic transformations and loads the results into the data warehouse." |
| ELT (extract, load, transform) | wyodrębnianie, ładowanie, transformacja | "We switched from ETL to ELT — we load raw data into the warehouse first and run transformations using dbt directly in the layer." |
| data pipeline | potok danych; pipeline danych | "The data pipeline runs every hour and consists of six stages: ingestion, validation, transformation, enrichment, aggregation and load." |
| batch processing | przetwarzanie wsadowe | "Batch processing runs overnight — all transactions from the previous day are processed in a single job starting at 02:00 UTC." |
| streaming | przetwarzanie strumieniowe | "We moved the fraud detection system to streaming — events are now processed within two seconds of the transaction occurring." |
| orchestration | orkiestracja; zarządzanie przepływem zadań | "Airflow handles all pipeline orchestration — each DAG defines the dependencies between tasks and the retry logic on failure." |
| DAG (directed acyclic graph) | skierowany graf acykliczny | "The DAG for the orders pipeline has 14 nodes — the schema validation task must complete successfully before transformation begins." |
| job scheduler | harmonogram zadań; scheduler | "The job scheduler triggers the pipeline at 06:00 UTC daily and sends an alert to the on-call engineer if the run exceeds 90 minutes." |
Data storage & architecture — 8 terminów
| EN Term | PL Tłumaczenie | Przykład zdania |
|---|---|---|
| data warehouse | hurtownia danych | "Our data warehouse runs on BigQuery — all analytical queries from BI tools are executed against the gold layer." |
| data lake | jezioro danych | "Raw event data lands in the data lake first — it's stored in Parquet format and partitioned by date." |
| data lakehouse | architektura łącząca cechy hurtowni i jeziora danych | "We implemented a data lakehouse on Delta Lake — it gives us ACID transactions on top of object storage." |
| schema | schemat danych | "The pipeline broke because the upstream team added a NOT NULL column to the schema without updating the data contract." |
| table | tabela | "The orders table is the most critical table in the warehouse — it's queried by 14 downstream models." |
| partition | partycja | "The events table is partitioned by date — always filter on the partition column to avoid full table scans." |
| data mart | magazyn danych (dla określonej domeny) | "The finance data mart contains pre-aggregated revenue metrics and is refreshed daily at 07:00 UTC." |
| dimensional modelling | modelowanie wymiarowe | "We use dimensional modelling — fact tables for transactions and dimension tables for customers, products and time." |
Data quality & governance — 6 terminów
| EN Term | PL Tłumaczenie | Przykład zdania |
|---|---|---|
| data quality | jakość danych | "We run automated data quality checks after every pipeline run — if any check fails, the downstream models are blocked." |
| data lineage | rodowód danych; linia danych | "Data lineage lets us trace the origin of any metric back through every transformation step to the raw source." |
| data catalogue | katalog danych | "The data catalogue documents every dataset in the warehouse — owners, freshness SLAs, column descriptions and sample queries." |
| data contract | kontrakt danych | "We've introduced data contracts between the backend engineering team and the data platform — schema changes now require a 14-day notice period." |
| SLA (data freshness) | SLA świeżości danych | "The SLA for the orders table is T+2 hours — if data is more than two hours stale, the on-call engineer is paged automatically." |
| data observability | obserwowalność danych | "Our data observability platform monitors volume, freshness, schema changes and distribution anomalies across all critical tables." |
Tools & infrastructure — 6 terminów
| EN Term | PL Tłumaczenie | Przykład zdania |
|---|---|---|
| Apache Spark | Apache Spark | "We use Apache Spark for large-scale transformations — the daily aggregation job processes 800 million rows in under 12 minutes." |
| dbt (data build tool) | dbt — narzędzie do transformacji danych | "All our transformation logic lives in dbt — analysts can write SQL models and dbt handles dependency resolution, testing and documentation." |
| Airflow | Apache Airflow | "Airflow orchestrates all 47 production pipelines — the UI gives us a clear view of DAG runs, failures and SLA misses." |
| Kafka | Apache Kafka | "Real-time events are published to Kafka topics and consumed by the streaming pipeline within milliseconds." |
| cloud storage (S3/GCS/ADLS) | obiektowa pamięć masowa w chmurze | "Raw data lands in S3, is processed by Spark and the results are written back to GCS for consumption by the warehouse." |
| query optimisation | optymalizacja zapytań | "Query optimisation reduced the average dashboard load time from 45 seconds to 3 seconds — the key fix was partitioning and clustering the events table." |
Scenariusze komunikacji
a) Raport o awarii pipeline — 8 zwrotów
- "The ETL job failed at the transformation stage at 03:14 UTC. The root cause was a schema change in the upstream source table that broke the type casting."
- "The upstream team added a new column with a NOT NULL constraint — our pipeline was not notified of the change and failed on the first null value encountered."
- "The impact is limited to the orders pipeline — seven downstream dbt models are blocked, affecting the sales dashboard and the finance data mart."
- "Data for yesterday is currently unavailable in the warehouse. Historical data up to 2026-06-07 23:59 UTC is complete and unaffected."
- "The fix is a two-line schema update in our dbt model — I've already pushed the change to a feature branch and it's in review."
- "Estimated time to resolution is 90 minutes — pipeline will be re-run after the fix is merged and the backfill will cover the missed window."
- "As a prevention measure, I'm proposing we implement a schema contract test that fails loudly before the transformation stage if the source schema changes."
- "I'll send a postmortem by end of day covering root cause, impact, fix and the three process improvements we're implementing."
b) Przegląd architektury danych — 6 zwrotów
- "We're proposing a medallion architecture — bronze for raw ingestion, silver for cleaned and conformed data, gold for business-ready aggregations."
- "The current design has a single monolithic pipeline — we're refactoring into domain-oriented data products, each owned by the relevant engineering team."
- "The data lakehouse gives us ACID transactions and time travel on top of object storage — we get the flexibility of a data lake with the reliability of a data warehouse."
- "We've identified three bottlenecks in the current architecture: lack of partitioning on the events table, no incremental loading strategy and no data contract enforcement."
- "The proposed migration has three phases: schema standardisation in Q3, incremental load adoption in Q4 and full medallion deployment by Q1 2027."
- "Each domain team will own their data products and be accountable for freshness SLAs — the data platform team provides the tooling and the standards."
c) Eskalacja problemu jakości danych — 6 zwrotów
- "We detected a data quality issue in the customer dimension — approximately 3,400 records have null values in the email field that should be mandatory."
- "The root cause is an upstream API change that removed email from the payload for customers who opted out of marketing — the schema wasn't updated to reflect this."
- "The impacted downstream models are: the email campaign audience, the churn prediction feature store and the monthly active users report."
- "We've blocked the downstream models until the quality issue is resolved — sending incorrect data to the campaign tool would be worse than sending no data."
- "I recommend we add a NOT NULL check on the email field as a dbt test — if it fails, the model run stops before writing to the gold layer."
- "For immediate resolution: the data platform team can backfill the missing emails from the CRM for the affected records — I need 30 minutes and a data access approval."
Krótki dialog — awaria pipeline: Data Engineer, Data Scientist, Product Manager
Data Scientist: "Hey — our Berlin dashboard shows no sales data for yesterday. What's going on?"
Data Engineer: "The orders pipeline failed at 03:14 UTC. Root cause: the backend team pushed a schema change to the source table — they added a NOT NULL column without updating the data contract. Our type casting broke on the first null it hit."
Product Manager: "How much data is affected and when will it be back?"
Data Engineer: "Everything from midnight to 03:14 is missing — about three hours of transactions. Historical data before that window is complete. The fix is already in review — I estimate the pipeline will be green and the backfill complete within 90 minutes."
Data Scientist: "Can we prevent this happening again?"
Data Engineer: "Yes. I'm adding a schema contract test that runs before the transformation stage — if the source schema deviates from what we expect, the job fails fast and pages on-call immediately. I'll also work with the backend team to add data contract sign-off to their PR process for any table we depend on."
Słownictwo data contracts — 6 zwrotów
- "The data contract specifies that the orders table will be refreshed by 06:00 UTC daily with a maximum latency of 30 minutes."
- "Any breaking schema change — adding a NOT NULL column, renaming a field, changing a data type — requires 14 days' notice to all downstream consumers."
- "The data contract is owned by the checkout engineering team — all schema change requests must go through their RFC process."
- "We define a breaking change as any modification that causes an existing downstream query or model to fail without code changes on the consumer side."
- "The SLA breach threshold is two consecutive missed refreshes — at that point, the data platform team is automatically paged and the downstream models are suspended."
- "Data contracts are versioned in our Git repository alongside the schema definitions — the change history is auditable and rollback is possible within 30 days."
Najczęstsze błędy Polaków
1. „ETL" — wymowa. Wymawiamy litery osobno: E-T-L /ˌiː tiː ˈel/. To akronim, nie słowo. Tak samo: ELT = E-L-T, DAG = D-A-G (lub /dæɡ/ w środowiskach Airflow). ✅ "The ETL [/ˌiː tiː ˈel/] job runs at midnight."
2. „pipeline" vs „workflow". Pipeline to sekwencja ściśle technicznych kroków transformacji danych. Workflow to szersze pojęcie obejmujące również zadania ludzkie i procesy biznesowe. W rozmowach data engineerskich zawsze mów pipeline. ✅ "The data pipeline failed", nie "the workflow failed".
3. „schema" — wymowa. Poprawna wymowa: /ˈskiːmə/ — pierwsze „e" długie jak w „scheme". Polacy często mówią „shema". ❌ "shema change" → ✅ "schema [/ˈskiːmə/] change". Liczba mnoga: schemas (częściej w IT) lub schemata (formalnie).
4. „data" — niepoliczalny w data engineeringu. W środowisku data engineeringu dominuje forma niepoliczalna: ✅ "the data is available", "the data is stale", "the data is partitioned by date". Forma "the data are" jest poprawna, ale brzmi archaicznie w kontekście IT.
5. „load" vs „ingest". Ingest to szerszy proces przyjmowania danych z zewnętrznych źródeł do platformy. Load to ładowanie do docelowego systemu analitycznego — ostatni krok ETL. ❌ "We load data from the API" → ✅ "We ingest data from the API." ✅ "We load the transformed data into the warehouse."
Quick Reference Table — 28 terminów
| EN Term | PL Tłumaczenie | Kategoria |
|---|---|---|
| ETL | wyodrębnianie, transformacja, ładowanie | Pipeline |
| ELT | wyodrębnianie, ładowanie, transformacja | Pipeline |
| data pipeline | potok danych | Pipeline |
| batch processing | przetwarzanie wsadowe | Pipeline |
| streaming | przetwarzanie strumieniowe | Pipeline |
| orchestration | orkiestracja | Pipeline |
| DAG | skierowany graf acykliczny | Pipeline |
| job scheduler | harmonogram zadań | Pipeline |
| data warehouse | hurtownia danych | Storage |
| data lake | jezioro danych | Storage |
| data lakehouse | architektura lakehouse | Storage |
| schema | schemat danych | Storage |
| table | tabela | Storage |
| partition | partycja | Storage |
| data mart | magazyn danych dziedzinowy | Storage |
| dimensional modelling | modelowanie wymiarowe | Storage |
| data quality | jakość danych | Quality |
| data lineage | rodowód danych | Quality |
| data catalogue | katalog danych | Quality |
| data contract | kontrakt danych | Quality |
| SLA (data freshness) | SLA świeżości danych | Quality |
| data observability | obserwowalność danych | Quality |
| Apache Spark | Apache Spark | Tools |
| dbt | narzędzie do transformacji danych | Tools |
| Airflow | Apache Airflow | Tools |
| Kafka | Apache Kafka | Tools |
| cloud storage (S3/GCS/ADLS) | obiektowa pamięć masowa | Tools |
| query optimisation | optymalizacja zapytań | Tools |
Podsumowanie
Angielski Data Engineera to język precyzji technicznej i komunikacji incydentowej. Różnica między load a ingest, między pipeline a workflow, między schema wymawianym poprawnie a wymawianym z błędem — to sygnały, które mówią rozmówcy, czy masz naprawdę głębokie doświadczenie z danymi.
Terminologię z tego artykułu znajdziesz w fiszkach IT & Programowanie — kategoria Data Engineer. Systematyczna nauka 10–15 minut dziennie pozwoli Ci po trzech tygodniach sprawnie komunikować się po angielsku w każdym środowisku data engineering — od porannego stand-upu po postmortem po krytycznej awarii pipeline'u.
Powiązane artykuły: angielski dla AI/ML Engineera, angielski dla Data Scientista oraz angielski dla DevOps Engineera.