Evaluation Pipeline
Introduction
The evaluation pipeline assesses trained model performance using standardized metrics and visualizations. It supports multiple evaluation modes and seamlessly integrates with MLflow for loading models and logging results.
Like the training pipeline, evaluation is configuration-driven - you define what and how to evaluate in YAML files.
Pipeline Overview
graph TB
CONFIG[Configuration Files] --> SETUP[Setup Environment]
SETUP --> MLFLOW[Load from MLflow/Checkpoint]
MLFLOW --> DATA[Load Evaluation Dataset]
DATA --> MODEL[Load Model]
MODEL --> PREDICT[Generate Predictions]
PREDICT --> METRICS[Compute Metrics]
METRICS --> VIZ[Generate Visualizations]
VIZ --> SAVE[Save Results]
SAVE --> MLFLOW_LOG[Log to MLflow]
style CONFIG fill:#FF6B35
style METRICS fill:#0F596E,color:#fff
style MLFLOW_LOG fill:#0097B1,color:#fff
Evaluation Script Entry Point
from ainxt.scripts.evaluation import evaluate
from context import CONTEXT
# Evaluate using MLflow run
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
evaluation_config="config/evaluation.yaml",
mlflow_info=(experiment_id, experiment_name, run_id)
)
Evaluation Modes
aiNXT supports three evaluation modes:
Mode 1: Evaluate from MLflow Run
Load model and data from a completed training run:
# After training
model, checkpoint_dir, mlflow_info = train(...)
# Evaluate using MLflow info
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
evaluation_config="config/evaluation.yaml",
mlflow_info=mlflow_info # Tuple: (exp_id, exp_name, run_id)
)
What Happens: 1. Downloads model artifacts from MLflow run 2. Downloads dataset artifacts from MLflow run 3. Loads model and generates predictions 4. Computes metrics and creates visualizations 5. Logs evaluation results back to the same MLflow run
Mode 2: Evaluate from Checkpoint Directory
Load model and data from local checkpoint:
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
model_config="config/model.yaml",
evaluation_config="config/evaluation.yaml",
checkpoint_dir="checkpoints/experiment_001"
)
What Happens:
1. Loads model from checkpoint_dir/model
2. Loads dataset from config or checkpoint cache
3. Generates predictions
4. Computes metrics and visualizations
5. Saves results to checkpoint_dir/evaluation
Mode 3: Evaluate from Cached Predictions
Use pre-generated predictions (for faster iteration on metrics):
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
evaluation_config="config/evaluation.yaml",
predictions_file="checkpoints/experiment_001/predictions.json"
)
What Happens: 1. Loads instances and predictions from file 2. Skips model loading and prediction generation 3. Directly computes metrics and visualizations 4. Useful for experimenting with different metrics/visualizations
Pipeline Steps
Step 1: Configuration Setup
Load and merge configuration files:
config = setup_configuration(
config=config,
data_config=data_config,
model_config=model_config,
evaluation_config=evaluation_config
)
config/evaluation.yaml:
evaluation:
# Metrics to compute
metrics:
- name: accuracy
- name: precision
params:
average: macro
- name: recall
params:
average: macro
- name: f1_score
params:
average: weighted
# Visualizations to generate
visualizations:
- name: confusion_matrix
params:
normalize: true
- name: roc_curve
- name: precision_recall_curve
- name: classification_report
# Optional: Subset of data to evaluate
split: 0.5 # Use 50% of dataset
mlflow:
tracking_uri: http://localhost:5000
run_id: abc123def456 # Optional: specific run to evaluate
Step 2: Environment Preparation
Setup evaluation directory and logging:
checkpoint_dir, logger = setup_evaluation_environment(
checkpoint_dir=checkpoint_dir,
logger=logger,
run_id=run_id,
mlflow_info=mlflow_info
)
Evaluation Directory Structure:
checkpoint_dir/
├── evaluation/
│ ├── log_evaluation.txt
│ ├── predictions.json
│ ├── metrics.json
│ └── visualizations/
│ ├── confusion_matrix.png
│ ├── roc_curve.png
│ └── classification_report.txt
└── model/ # From training
Step 3: MLflow Artifact Download
If evaluating from MLflow run, download artifacts:
# Downloads model, config, and datasets from MLflow
download_mlflow_artifacts(
run_id=run_id,
checkpoint_dir=checkpoint_dir,
logger=logger
)
Downloaded Artifacts:
- model/ - Trained model files
- config.yaml - Training configuration
- data/test.json - Test dataset (if logged)
- Training metrics and parameters
Step 4: Load Evaluation Dataset
Load the dataset to evaluate on:
dataset = load_evaluation_dataset(
context=context,
config=config,
checkpoint_dir=checkpoint_dir,
split=split,
analyze=analyze,
logger=logger
)
Dataset Sources (in priority order):
1. From data_config if provided
2. From checkpoint directory cache (data/test.json)
3. From MLflow run artifacts
Optional Splitting:
# Evaluate on subset for faster iteration
dataset = load_evaluation_dataset(..., split=0.1) # Use 10%
Step 5: Model Setup
Load the trained model:
model = setup_model_for_evaluation(
context=context,
config=config,
checkpoint_dir=checkpoint_dir,
predictions_file=predictions_file,
logger=logger
)
Model Sources:
1. From checkpoint_dir/model if available
2. From model_config if provided
3. Skipped if predictions_file provided
Step 6: Generate Predictions
Create predictions for all instances:
if predictions_file:
# Load cached predictions
instances, predictions = load_predictions_from_file(
predictions_file, encoder, decoder
)
else:
# Generate fresh predictions
instances, predictions = generate_predictions_with_model(
model=model,
dataset=dataset,
batch_size=batch_size,
logger=logger
)
Prediction Generation:
predictions = []
for batch in batches(dataset, batch_size=32):
batch_predictions = model.predict_batch(batch)
predictions.extend(batch_predictions)
# Save for future use
save_predictions(instances, predictions, "predictions.json")
Step 7: Compute Metrics
Calculate evaluation metrics:
metrics, metric_results = compute_metrics(
context=context,
config=config,
instances=instances,
predictions=predictions,
logger=logger
)
Metrics Computation:
from context import CONTEXT
# Load metric functions from config
metrics = CONTEXT.load_metrics(
config.evaluation.metrics,
task=config.task
)
# Compute each metric
metric_results = {}
for metric in metrics:
score = metric(instances, predictions)
metric_results[metric.name] = score
logger.info(f"{metric.name}: {score:.4f}")
Example Output:
[INFO] Computing metrics...
[INFO] accuracy: 0.9234
[INFO] precision: 0.9187
[INFO] recall: 0.9145
[INFO] f1_score: 0.9166
Step 8: Generate Visualizations
Create visualization artifacts:
visualizations = generate_visualizations(
context=context,
config=config,
instances=instances,
predictions=predictions,
checkpoint_dir=checkpoint_dir,
logger=logger
)
Visualization Generation:
# Load visualization functions
visualizations = CONTEXT.load_visualizations(
config.evaluation.visualizations,
task=config.task
)
# Generate each visualization
for viz in visualizations:
output_path = checkpoint_dir / "evaluation/visualizations" / f"{viz.name}.png"
viz(instances, predictions, save_path=output_path)
logger.info(f"Saved {viz.name} to {output_path}")
Common Visualizations: - Confusion Matrix: True vs predicted labels - ROC Curve: True positive vs false positive rate - Precision-Recall Curve: Precision vs recall tradeoff - Classification Report: Detailed per-class metrics
Step 9: Save and Log Results
Save evaluation results and log to MLflow:
save_evaluation_artifacts(
checkpoint_dir=checkpoint_dir,
metric_results=metric_results,
instances=instances,
predictions=predictions,
mlflow_enabled=mlflow_enabled,
logger=logger
)
Saved Artifacts:
-
Metrics JSON:
-
Predictions JSON:
-
MLflow Logging:
import mlflow # Log metrics mlflow.log_metrics(metric_results) # Log visualizations mlflow.log_artifacts( checkpoint_dir / "evaluation/visualizations", artifact_path="evaluation/visualizations" ) # Log predictions mlflow.log_artifact( checkpoint_dir / "evaluation/predictions.json", artifact_path="evaluation" )
Usage Examples
Example 1: End-to-End Train + Evaluate
from ainxt.scripts.training import train
from ainxt.scripts.evaluation import evaluate
from context import CONTEXT
# Train model
model, checkpoint_dir, mlflow_info = train(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
model_config="config/model.yaml",
training_config="config/training.yaml"
)
# Evaluate on test set
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
evaluation_config="config/evaluation.yaml",
mlflow_info=mlflow_info # Uses same run
)
print(f"Evaluation results saved to: {eval_dir}")
Example 2: Evaluate Existing MLflow Run
# Evaluate a model from MLflow by run_id
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
evaluation_config={
"mlflow": {
"tracking_uri": "http://localhost:5000",
"run_id": "abc123def456"
},
"evaluation": {
"metrics": [{"name": "accuracy"}, {"name": "f1_score"}]
}
}
)
Example 3: Evaluate from Checkpoint
# Evaluate from local checkpoint directory
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
model_config="config/model.yaml",
evaluation_config="config/evaluation.yaml",
checkpoint_dir="checkpoints/experiment_042"
)
Example 4: Fast Iteration with Cached Predictions
# First: Generate and cache predictions
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
model_config="config/model.yaml",
evaluation_config="config/evaluation_v1.yaml",
checkpoint_dir="checkpoints/exp"
)
# Predictions saved to checkpoints/exp/evaluation/predictions.json
# Later: Experiment with different metrics (much faster)
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
evaluation_config="config/evaluation_v2.yaml", # Different metrics
predictions_file="checkpoints/exp/evaluation/predictions.json"
)
Example 5: Using model.evaluate() Method
Some models implement an evaluate() method for custom evaluation logic:
instances, predictions, eval_dir = evaluate(
context=CONTEXT,
config="config/base.yaml",
data_config="config/data.yaml",
model_config="config/model.yaml",
evaluation_config="config/evaluation.yaml",
use_model_evaluate=True # Calls model.evaluate() instead of predictions
)
Configuration Options
Evaluation Configuration
evaluation:
# Metrics to compute
metrics:
- name: accuracy
- name: precision
params:
average: macro
zero_division: 0
- name: recall
params:
average: macro
- name: f1_score
params:
average: weighted
- name: confusion_matrix
- name: roc_auc_score
params:
multi_class: ovr
# Visualizations to generate
visualizations:
- name: confusion_matrix
params:
normalize: true
cmap: Blues
- name: roc_curve
- name: precision_recall_curve
- name: classification_report
params:
digits: 3
# Optional: evaluate on subset
split: 1.0 # Use 100% of dataset (default)
# Optional: batch size for predictions
batch_size: 32
mlflow:
tracking_uri: http://localhost:5000
# Optional: specific run to evaluate
run_id: abc123def456
Custom Metrics
Define custom metrics in your codebase:
from ainxt.factory import builder_name
from ainxt.serving.evaluation import Metric
@builder_name(task="classification", name="custom_metric")
def my_custom_metric(instances, predictions):
"""Custom evaluation metric"""
# Your metric logic
return score
# Now available in config
# evaluation:
# metrics:
# - name: custom_metric
Custom Visualizations
Define custom visualizations:
from ainxt.factory import builder_name
from ainxt.serving.evaluation import Visualization
@builder_name(task="classification", name="custom_viz")
def my_custom_visualization(instances, predictions, save_path):
"""Custom visualization"""
import matplotlib.pyplot as plt
# Your plotting logic
plt.savefig(save_path)
# Now available in config
# evaluation:
# visualizations:
# - name: custom_viz
Return Values
The evaluate function returns a tuple:
- instances: Sequence of evaluated instances
- predictions: Sequence of prediction sequences (one per instance)
- eval_dir: Path to evaluation directory (str)
Using Return Values:
# Inspect specific predictions
for instance, preds in zip(instances, predictions):
print(f"True: {instance.label}")
print(f"Pred: {preds[0].label}")
print(f"Confidence: {preds[0].confidence}")
# Load saved metrics
import json
with open(eval_dir / "metrics.json") as f:
metrics = json.load(f)
print(f"Accuracy: {metrics['accuracy']}")
Integration with MLflow
Viewing Evaluation Results
import mlflow
# Set tracking URI
mlflow.set_tracking_uri("http://localhost:5000")
# Get run
run = mlflow.get_run(run_id)
# View metrics
print("Training metrics:")
print(run.data.metrics)
# Download evaluation artifacts
client = mlflow.tracking.MlflowClient()
local_dir = client.download_artifacts(run_id, "evaluation/visualizations")
print(f"Visualizations downloaded to: {local_dir}")
Comparing Multiple Runs
import mlflow
import pandas as pd
# Search for runs
runs = mlflow.search_runs(
experiment_ids=[experiment_id],
filter_string="metrics.accuracy > 0.9"
)
# Compare metrics
comparison = runs[["run_id", "metrics.accuracy", "metrics.f1_score"]]
print(comparison.sort_values("metrics.accuracy", ascending=False))
Best Practices
1. Always Evaluate on Held-Out Test Set
2. Use Multiple Metrics
evaluation:
metrics:
- name: accuracy # Overall correctness
- name: f1_score # Balance precision/recall
- name: confusion_matrix # Per-class breakdown
- name: roc_auc_score # Threshold-independent
3. Generate Visualizations
evaluation:
visualizations:
- name: confusion_matrix
- name: roc_curve
- name: classification_report
4. Cache Predictions for Iteration
# First run: generate predictions
evaluate(..., checkpoint_dir="checkpoints/exp")
# Later: iterate on metrics (faster)
evaluate(..., predictions_file="checkpoints/exp/evaluation/predictions.json")
5. Track Evaluation in MLflow
# Evaluation metrics automatically logged to same run
model, checkpoint_dir, mlflow_info = train(...)
evaluate(..., mlflow_info=mlflow_info)
Next Steps
- MLflow Integration - Deep dive into experiment tracking
- Training Pipeline - Training models for evaluation
- Core Abstractions - Understanding predictions and metrics