Twój model robi 94% accuracy — ale nie umiesz tego obronić po angielsku

Wyobraź sobie: demo przed zagranicznym stakeholderem, model działa świetnie, wyniki są imponujące — a Ty utknąłeś na pierwszym pytaniu: "Can you walk us through the evaluation metrics?". To właśnie ten moment, do którego przygotowuje ten artykuł.

Jeśli pracujesz jako ML Engineer, Data Scientist, AI Researcher, MLOps Engineer lub AI Product Manager, angielski to Twoje codzienne narzędzie — na code review, standup, prezentacji wyników i w paperach. Poniżej znajdziesz słownictwo, którego faktycznie używasz w tej pracy.

Typy modeli i architektur — 8 terminów

EN Term	PL	Przykład zdania
neural network	sieć neuronowa	"We're using a three-layer neural network with ReLU activations."
transformer	transformer	"The transformer architecture allows the model to attend to the full input sequence."
LLM (large language model)	duży model językowy	"We fine-tuned an open-source LLM on our internal documentation."
CNN (convolutional neural network)	splotowa sieć neuronowa	"The CNN extracts spatial features from the input images before classification."
RNN (recurrent neural network)	rekurencyjna sieć neuronowa	"We replaced the RNN with a transformer because of the vanishing gradient problem."
generative model	model generatywny	"The generative model was trained on 10,000 synthetic data samples."
foundation model	model bazowy	"We're building on top of a foundation model and adding domain-specific layers."
fine-tuning	dostrajanie modelu	"Fine-tuning on 2,000 labelled examples improved recall by 18 percentage points."

Trening modelu — 10 terminów

EN Term	PL	Przykład zdania
training data	dane treningowe	"The model was trained on 500,000 labelled examples from production logs."
validation set	zbiór walidacyjny	"We use the validation set to tune hyperparameters without touching the test data."
test set	zbiór testowy	"The test set is held out until the very end to give an unbiased performance estimate."
overfitting	przetrenowanie	"The model is overfitting — training accuracy is 99% but validation accuracy dropped to 72%."
underfitting	niedotrenowanie	"Adding more layers didn't help; the model is underfitting because the dataset is too small."
hyperparameter	hiperparametr	"We ran a grid search over learning rate and batch size as the key hyperparameters."
epoch	epoka	"Training for 50 epochs gave diminishing returns after epoch 30."
batch size	wielkość batcha	"Reducing batch size from 256 to 64 improved generalization on the validation set."
learning rate	współczynnik uczenia	"The learning rate scheduler reduces the rate by 0.1 every 10 epochs."
gradient descent	opadanie gradientowe	"We use stochastic gradient descent with momentum to speed up convergence."

Ewaluacja — 8 terminów

EN Term	PL	Przykład zdania
accuracy	dokładność (odsetek poprawnych klasyfikacji)	"Overall accuracy is 94%, but it's misleading because the classes are imbalanced."
precision	precyzja	"Precision is 0.91, meaning 91% of positive predictions are actually correct."
recall	czułość / pokrycie	"Recall dropped to 0.67 — we're missing too many true positives."
F1 score	miara F1	"Our model achieves an F1 score of 0.89 on the test set, a 12% improvement over baseline."
confusion matrix	macierz pomyłek	"Looking at the confusion matrix, most errors are false negatives in class 3."
AUC-ROC	pole pod krzywą ROC	"The AUC-ROC of 0.96 shows strong discrimination across all threshold values."
benchmark	punkt odniesienia	"We evaluated against three public benchmarks: GLUE, SuperGLUE and HellaSwag."
baseline	wynik bazowy	"The baseline is a logistic regression achieving 78% accuracy — we need to beat that."

Produkcja i wdrożenie — 6 terminów

EN Term	PL	Przykład zdania
inference	inferencja / wnioskowanie	"Inference latency needs to be under 100ms for real-time use cases."
latency	opóźnienie	"We reduced inference latency by 40% by switching to ONNX runtime."
throughput	przepustowość	"The system handles 500 inference requests per second at peak throughput."
model drift	dryfowanie modelu	"We detected model drift after the input data distribution shifted in January."
deployment	wdrożenie	"We use a blue-green deployment strategy to roll out new model versions safely."
A/B testing	testy A/B	"We ran an A/B test comparing model v2 and v3 on 10% of production traffic."

Scenariusze komunikacji

a) Prezentacja wyników stakeholderom — 8 zwrotów

"Our model achieves an F1 score of 0.89 on the test set, which represents a 12% improvement over the baseline."
"The model was evaluated on a held-out test set of 50,000 examples it had never seen during training."
"Precision is particularly important here because false positives have a direct business cost."
"The confusion matrix shows that most errors are concentrated in the boundary cases between classes 2 and 3."
"We benchmarked against three competing approaches — our model outperforms all of them on recall."
"The AUC-ROC of 0.96 gives us confidence that the model generalizes well across different operating thresholds."
"These results are consistent across all three validation folds, which reduces the risk of data leakage."
"We estimate that deploying this model will reduce manual review workload by approximately 35%."

b) Omawianie ograniczeń i ryzyk — 6 zwrotów

"The model tends to underperform on edge cases where the input is significantly out-of-distribution."
"One known limitation is that the model was trained on data from 2021–2023, so recent events are underrepresented."
"We should monitor for model drift once the system is in production — I'd suggest a monthly retraining cadence."
"The training data has class imbalance: class A represents 80% of samples, which inflates overall accuracy."
"This is a black-box model, so explainability is limited — we may need SHAP values for regulatory compliance."
"The model is sensitive to input noise; we'll need robust preprocessing before inference in production."

c) Code review i dyskusja PR — 6 zwrotów

"I'd suggest we add dropout layers here to reduce overfitting on the training set."
"The batch size of 512 may be too large for this dataset — it could hurt generalization."
"Can we add early stopping based on validation loss? It'll prevent training for unnecessary epochs."
"This feature engineering step introduces data leakage — the validation set sees future information."
"The learning rate schedule looks aggressive after epoch 20; let's monitor convergence more carefully."
"We should log the confusion matrix to MLflow alongside accuracy so we have the full picture."

Krótki dialog — ML Engineer prezentuje wyniki Product Managerowi

ML Engineer: "The model achieves an F1 score of 0.89 on the test set, which is a 12% improvement over our baseline."

Product Manager: "That sounds great — but what does F1 score actually mean for the business?"

ML Engineer: "It means we're correctly flagging 89% of the cases that matter, with very few false alarms. In practice, roughly 1 in 10 critical events will still slip through."

Product Manager: "And how confident are you this will hold up in production?"

ML Engineer: "We've tested across three validation sets and results are consistent. The main risk is model drift if the input data distribution changes significantly — I'd recommend monthly retraining."

Słownictwo z publikacji i paperów — 8 terminów

EN Term	PL	Typowy kontekst
ablation study	badanie ablacyjne	"Our ablation study shows that removing the attention layer reduces F1 by 8 points."
state-of-the-art (SOTA)	wyniki najlepsze w danej klasie	"Our method achieves state-of-the-art performance on the ImageNet benchmark."
novel approach	nowatorska metoda	"We propose a novel approach that combines retrieval-augmented generation with fine-tuning."
we propose	proponujemy	"We propose a lightweight fine-tuning strategy that reduces compute costs by 60%."
outperforms	przewyższa	"Our model outperforms the baseline by 7.3% on the test split."
evaluated on	oceniany na	"The system was evaluated on three publicly available datasets."
findings suggest	wyniki sugerują	"Our findings suggest that data quality matters more than model size in this domain."
further work	dalsze badania	"Further work is needed to assess performance on low-resource languages."

Najczęstsze błędy Polaków

1. "accuracy" ≠ "precyzja" — to różne metryki. Accuracy to odsetek wszystkich poprawnych klasyfikacji, a precision to miara dotycząca tylko predykcji pozytywnych. ❌ "Our precision is 94%" (gdy masz na myśli overall accuracy) → ✅ "Our accuracy is 94%"

2. "model is learning" → "model is being trained". Model nie uczy się sam z siebie — jest trenowany. ❌ "The model is learning from new data." → ✅ "The model is being trained on new data."

3. "I make a model" → "I build / train a model". ❌ "I made a classification model." → ✅ "I built / trained a classification model."

4. "data are" vs "data is". Oba są poprawne. W kontekście naukowym data are jest bardziej formalny; w komunikacji biznesowej data is jest powszechniejszy i akceptowalny.

5. Rodzajnik przed "neural network". ❌ "We implemented neural network for this task." → ✅ "We implemented a neural network for this task."

Quick Reference Table — wszystkie terminy

EN Term	PL Tłumaczenie	Kontekst
neural network	sieć neuronowa	architektura
transformer	transformer	architektura
LLM	duży model językowy	architektura
CNN	splotowa sieć neuronowa	architektura
RNN	rekurencyjna sieć neuronowa	architektura
generative model	model generatywny	architektura
foundation model	model bazowy	architektura
fine-tuning	dostrajanie modelu	trening
training data	dane treningowe	trening
validation set	zbiór walidacyjny	trening
test set	zbiór testowy	trening
overfitting	przetrenowanie	trening
underfitting	niedotrenowanie	trening
hyperparameter	hiperparametr	trening
epoch	epoka	trening
batch size	wielkość batcha	trening
learning rate	współczynnik uczenia	trening
gradient descent	opadanie gradientowe	trening
accuracy	dokładność	ewaluacja
precision	precyzja	ewaluacja
recall	czułość	ewaluacja
F1 score	miara F1	ewaluacja
confusion matrix	macierz pomyłek	ewaluacja
AUC-ROC	pole pod krzywą ROC	ewaluacja
benchmark	punkt odniesienia	ewaluacja
baseline	wynik bazowy	ewaluacja
inference	inferencja	produkcja
latency	opóźnienie	produkcja
throughput	przepustowość	produkcja
model drift	dryfowanie modelu	produkcja
deployment	wdrożenie	produkcja
A/B testing	testy A/B	produkcja
ablation study	badanie ablacyjne	research
SOTA	wyniki state-of-the-art	research
outperforms	przewyższa	research
evaluated on	oceniany na	research
findings suggest	wyniki sugerują	research
further work	dalsze badania	research

Podsumowanie

Opanowanie słownictwa z ML i AI po angielsku to dziś wymóg, nie atut. Niezależnie czy prezentujesz wyniki zarządowi, dyskutujesz architekturę na code review, czy czytasz najnowsze papery — te terminy pojawiają się wszędzie.

Jeśli chcesz utrwalić to słownictwo, zajrzyj do naszego artykułu o słownictwie IT po angielsku lub sprawdź angielski dla analityka biznesowego IT. Kiedy przyjdzie czas na prezentację przed stakeholderami, przyda Ci się też artykuł o tym, jak prowadzić prezentację biznesową po angielsku.

Gotowe fiszki z terminologią AI/ML znajdziesz w ścieżce AI/ML Engineer w sekcji IT & Programowanie.

Angielski dla AI/ML Engineera — słownictwo machine learning i AI