Pipeline padł o 3:14 w nocy — a Data Scientist z Berlina czeka na wyjaśnienie po angielsku

Przychodzisz rano i widzisz alert: ETL job failed at 03:14 UTC. W skrzynce czeka wiadomość od Data Scientista z berlińskiego biura: "The dashboard shows no data for yesterday — what happened and which datasets are affected?" Za godzinę stand-up z Product Managerem. Musisz po angielsku wyjaśnić root cause błędu transformacji, opisać schemat danych, który się zmienił bez Twojej wiedzy, podać impact assessment i przedstawić plan naprawy.

Pipelines padają, schematy się zmieniają, dane mają opóźnienia — a komunikacja o tym musi być precyzyjna technicznie i zrozumiała dla stakeholderów, którzy nie są inżynierami.

Z tego artykułu skorzystają: Data Engineer, Senior Data Engineer, Analytics Engineer, Data Platform Engineer, DataOps Engineer oraz Big Data Engineer — każdy, kto buduje i utrzymuje infrastrukturę danych w środowisku anglojęzycznym.

28 terminów — 4 kategorie

Data pipeline & ETL — 8 terminów

EN Term	PL Tłumaczenie	Przykład zdania
ETL (extract, transform, load)	wyodrębnianie, transformacja, ładowanie	"Our ETL pipeline extracts raw data from the CRM, applies business logic transformations and loads the results into the data warehouse."
ELT (extract, load, transform)	wyodrębnianie, ładowanie, transformacja	"We switched from ETL to ELT — we load raw data into the warehouse first and run transformations using dbt directly in the layer."
data pipeline	potok danych; pipeline danych	"The data pipeline runs every hour and consists of six stages: ingestion, validation, transformation, enrichment, aggregation and load."
batch processing	przetwarzanie wsadowe	"Batch processing runs overnight — all transactions from the previous day are processed in a single job starting at 02:00 UTC."
streaming	przetwarzanie strumieniowe	"We moved the fraud detection system to streaming — events are now processed within two seconds of the transaction occurring."
orchestration	orkiestracja; zarządzanie przepływem zadań	"Airflow handles all pipeline orchestration — each DAG defines the dependencies between tasks and the retry logic on failure."
DAG (directed acyclic graph)	skierowany graf acykliczny	"The DAG for the orders pipeline has 14 nodes — the schema validation task must complete successfully before transformation begins."
job scheduler	harmonogram zadań; scheduler	"The job scheduler triggers the pipeline at 06:00 UTC daily and sends an alert to the on-call engineer if the run exceeds 90 minutes."

Data storage & architecture — 8 terminów

EN Term	PL Tłumaczenie	Przykład zdania
data warehouse	hurtownia danych	"Our data warehouse runs on BigQuery — all analytical queries from BI tools are executed against the gold layer."
data lake	jezioro danych	"Raw event data lands in the data lake first — it's stored in Parquet format and partitioned by date."
data lakehouse	architektura łącząca cechy hurtowni i jeziora danych	"We implemented a data lakehouse on Delta Lake — it gives us ACID transactions on top of object storage."
schema	schemat danych	"The pipeline broke because the upstream team added a NOT NULL column to the schema without updating the data contract."
table	tabela	"The orders table is the most critical table in the warehouse — it's queried by 14 downstream models."
partition	partycja	"The events table is partitioned by date — always filter on the partition column to avoid full table scans."
data mart	magazyn danych (dla określonej domeny)	"The finance data mart contains pre-aggregated revenue metrics and is refreshed daily at 07:00 UTC."
dimensional modelling	modelowanie wymiarowe	"We use dimensional modelling — fact tables for transactions and dimension tables for customers, products and time."

Data quality & governance — 6 terminów

EN Term	PL Tłumaczenie	Przykład zdania
data quality	jakość danych	"We run automated data quality checks after every pipeline run — if any check fails, the downstream models are blocked."
data lineage	rodowód danych; linia danych	"Data lineage lets us trace the origin of any metric back through every transformation step to the raw source."
data catalogue	katalog danych	"The data catalogue documents every dataset in the warehouse — owners, freshness SLAs, column descriptions and sample queries."
data contract	kontrakt danych	"We've introduced data contracts between the backend engineering team and the data platform — schema changes now require a 14-day notice period."
SLA (data freshness)	SLA świeżości danych	"The SLA for the orders table is T+2 hours — if data is more than two hours stale, the on-call engineer is paged automatically."
data observability	obserwowalność danych	"Our data observability platform monitors volume, freshness, schema changes and distribution anomalies across all critical tables."

Tools & infrastructure — 6 terminów

EN Term	PL Tłumaczenie	Przykład zdania
Apache Spark	Apache Spark	"We use Apache Spark for large-scale transformations — the daily aggregation job processes 800 million rows in under 12 minutes."
dbt (data build tool)	dbt — narzędzie do transformacji danych	"All our transformation logic lives in dbt — analysts can write SQL models and dbt handles dependency resolution, testing and documentation."
Airflow	Apache Airflow	"Airflow orchestrates all 47 production pipelines — the UI gives us a clear view of DAG runs, failures and SLA misses."
Kafka	Apache Kafka	"Real-time events are published to Kafka topics and consumed by the streaming pipeline within milliseconds."
cloud storage (S3/GCS/ADLS)	obiektowa pamięć masowa w chmurze	"Raw data lands in S3, is processed by Spark and the results are written back to GCS for consumption by the warehouse."
query optimisation	optymalizacja zapytań	"Query optimisation reduced the average dashboard load time from 45 seconds to 3 seconds — the key fix was partitioning and clustering the events table."

Scenariusze komunikacji

a) Raport o awarii pipeline — 8 zwrotów

"The ETL job failed at the transformation stage at 03:14 UTC. The root cause was a schema change in the upstream source table that broke the type casting."
"The upstream team added a new column with a NOT NULL constraint — our pipeline was not notified of the change and failed on the first null value encountered."
"The impact is limited to the orders pipeline — seven downstream dbt models are blocked, affecting the sales dashboard and the finance data mart."
"Data for yesterday is currently unavailable in the warehouse. Historical data up to 2026-06-07 23:59 UTC is complete and unaffected."
"The fix is a two-line schema update in our dbt model — I've already pushed the change to a feature branch and it's in review."
"Estimated time to resolution is 90 minutes — pipeline will be re-run after the fix is merged and the backfill will cover the missed window."
"As a prevention measure, I'm proposing we implement a schema contract test that fails loudly before the transformation stage if the source schema changes."
"I'll send a postmortem by end of day covering root cause, impact, fix and the three process improvements we're implementing."

b) Przegląd architektury danych — 6 zwrotów

"We're proposing a medallion architecture — bronze for raw ingestion, silver for cleaned and conformed data, gold for business-ready aggregations."
"The current design has a single monolithic pipeline — we're refactoring into domain-oriented data products, each owned by the relevant engineering team."
"The data lakehouse gives us ACID transactions and time travel on top of object storage — we get the flexibility of a data lake with the reliability of a data warehouse."
"We've identified three bottlenecks in the current architecture: lack of partitioning on the events table, no incremental loading strategy and no data contract enforcement."
"The proposed migration has three phases: schema standardisation in Q3, incremental load adoption in Q4 and full medallion deployment by Q1 2027."
"Each domain team will own their data products and be accountable for freshness SLAs — the data platform team provides the tooling and the standards."

c) Eskalacja problemu jakości danych — 6 zwrotów

"We detected a data quality issue in the customer dimension — approximately 3,400 records have null values in the email field that should be mandatory."
"The root cause is an upstream API change that removed email from the payload for customers who opted out of marketing — the schema wasn't updated to reflect this."
"The impacted downstream models are: the email campaign audience, the churn prediction feature store and the monthly active users report."
"We've blocked the downstream models until the quality issue is resolved — sending incorrect data to the campaign tool would be worse than sending no data."
"I recommend we add a NOT NULL check on the email field as a dbt test — if it fails, the model run stops before writing to the gold layer."
"For immediate resolution: the data platform team can backfill the missing emails from the CRM for the affected records — I need 30 minutes and a data access approval."

Krótki dialog — awaria pipeline: Data Engineer, Data Scientist, Product Manager

Data Scientist: "Hey — our Berlin dashboard shows no sales data for yesterday. What's going on?"

Data Engineer: "The orders pipeline failed at 03:14 UTC. Root cause: the backend team pushed a schema change to the source table — they added a NOT NULL column without updating the data contract. Our type casting broke on the first null it hit."

Product Manager: "How much data is affected and when will it be back?"

Data Engineer: "Everything from midnight to 03:14 is missing — about three hours of transactions. Historical data before that window is complete. The fix is already in review — I estimate the pipeline will be green and the backfill complete within 90 minutes."

Data Scientist: "Can we prevent this happening again?"

Data Engineer: "Yes. I'm adding a schema contract test that runs before the transformation stage — if the source schema deviates from what we expect, the job fails fast and pages on-call immediately. I'll also work with the backend team to add data contract sign-off to their PR process for any table we depend on."

Słownictwo data contracts — 6 zwrotów

"The data contract specifies that the orders table will be refreshed by 06:00 UTC daily with a maximum latency of 30 minutes."
"Any breaking schema change — adding a NOT NULL column, renaming a field, changing a data type — requires 14 days' notice to all downstream consumers."
"The data contract is owned by the checkout engineering team — all schema change requests must go through their RFC process."
"We define a breaking change as any modification that causes an existing downstream query or model to fail without code changes on the consumer side."
"The SLA breach threshold is two consecutive missed refreshes — at that point, the data platform team is automatically paged and the downstream models are suspended."
"Data contracts are versioned in our Git repository alongside the schema definitions — the change history is auditable and rollback is possible within 30 days."

Najczęstsze błędy Polaków

1. „ETL" — wymowa. Wymawiamy litery osobno: E-T-L /ˌiː tiː ˈel/. To akronim, nie słowo. Tak samo: ELT = E-L-T, DAG = D-A-G (lub /dæɡ/ w środowiskach Airflow). ✅ "The ETL [/ˌiː tiː ˈel/] job runs at midnight."

2. „pipeline" vs „workflow". Pipeline to sekwencja ściśle technicznych kroków transformacji danych. Workflow to szersze pojęcie obejmujące również zadania ludzkie i procesy biznesowe. W rozmowach data engineerskich zawsze mów pipeline. ✅ "The data pipeline failed", nie "the workflow failed".

3. „schema" — wymowa. Poprawna wymowa: /ˈskiːmə/ — pierwsze „e" długie jak w „scheme". Polacy często mówią „shema". ❌ "shema change" → ✅ "schema [/ˈskiːmə/] change". Liczba mnoga: schemas (częściej w IT) lub schemata (formalnie).

4. „data" — niepoliczalny w data engineeringu. W środowisku data engineeringu dominuje forma niepoliczalna: ✅ "the data is available", "the data is stale", "the data is partitioned by date". Forma "the data are" jest poprawna, ale brzmi archaicznie w kontekście IT.

5. „load" vs „ingest". Ingest to szerszy proces przyjmowania danych z zewnętrznych źródeł do platformy. Load to ładowanie do docelowego systemu analitycznego — ostatni krok ETL. ❌ "We load data from the API" → ✅ "We ingest data from the API." ✅ "We load the transformed data into the warehouse."

Quick Reference Table — 28 terminów

EN Term	PL Tłumaczenie	Kategoria
ETL	wyodrębnianie, transformacja, ładowanie	Pipeline
ELT	wyodrębnianie, ładowanie, transformacja	Pipeline
data pipeline	potok danych	Pipeline
batch processing	przetwarzanie wsadowe	Pipeline
streaming	przetwarzanie strumieniowe	Pipeline
orchestration	orkiestracja	Pipeline
DAG	skierowany graf acykliczny	Pipeline
job scheduler	harmonogram zadań	Pipeline
data warehouse	hurtownia danych	Storage
data lake	jezioro danych	Storage
data lakehouse	architektura lakehouse	Storage
schema	schemat danych	Storage
table	tabela	Storage
partition	partycja	Storage
data mart	magazyn danych dziedzinowy	Storage
dimensional modelling	modelowanie wymiarowe	Storage
data quality	jakość danych	Quality
data lineage	rodowód danych	Quality
data catalogue	katalog danych	Quality
data contract	kontrakt danych	Quality
SLA (data freshness)	SLA świeżości danych	Quality
data observability	obserwowalność danych	Quality
Apache Spark	Apache Spark	Tools
dbt	narzędzie do transformacji danych	Tools
Airflow	Apache Airflow	Tools
Kafka	Apache Kafka	Tools
cloud storage (S3/GCS/ADLS)	obiektowa pamięć masowa	Tools
query optimisation	optymalizacja zapytań	Tools

Podsumowanie

Angielski Data Engineera to język precyzji technicznej i komunikacji incydentowej. Różnica między load a ingest, między pipeline a workflow, między schema wymawianym poprawnie a wymawianym z błędem — to sygnały, które mówią rozmówcy, czy masz naprawdę głębokie doświadczenie z danymi.

Terminologię z tego artykułu znajdziesz w fiszkach IT & Programowanie — kategoria Data Engineer. Systematyczna nauka 10–15 minut dziennie pozwoli Ci po trzech tygodniach sprawnie komunikować się po angielsku w każdym środowisku data engineering — od porannego stand-upu po postmortem po krytycznej awarii pipeline'u.

Powiązane artykuły: angielski dla AI/ML Engineera, angielski dla Data Scientista oraz angielski dla DevOps Engineera.

Angielski dla Data Engineera — pipeline, ETL i architektura danych po angielsku