Core Concept: Metrics & Evaluations
Overview
Metrics and Evaluations are how you measure the quality of your Model predictions against ground truth Dataset annotations. aiNXT provides standardized metrics that work directly with Instance and Prediction objects.
Real-world analogy: Metrics are like grading a test: - Ground truth (Instances): The correct answers - Predictions: The student's answers - Metrics: The grading rubric (accuracy, precision, recall, etc.)
Based on notebooks/model/SH_Model_Prediction_and_Metrics.ipynb, aiNXT metrics follow a consistent interface that makes evaluation straightforward.
1. The Metrics Interface
How Metrics Work
All aiNXT metrics follow the same pattern:
def metric_function(
instances: Sequence[Instance], # Ground truth data
predictions: Sequence[Sequence[Prediction]], # Model predictions
**kwargs # Metric-specific parameters
) -> float: # Returns a score
"""
Args:
instances: Dataset instances with ground truth annotations
predictions: Model predictions (one Sequence[Prediction] per instance)
**kwargs: Additional parameters (threshold, average, etc.)
Returns:
Metric score as a float
"""
Key insight: Metrics receive two parallel sequences:
1. instances - The ground truth labels from your dataset
2. predictions - The model's predictions for those same instances
The metric compares them and returns a score.
Basic Metrics Usage
from ainxt.evaluation.classification.metrics import accuracy, precision, recall
# Assuming you have:
# - test_dataset: Dataset with ground truth labels
# - model: Trained model
# - predictions: model(test_dataset)
# Calculate accuracy
acc = accuracy(test_dataset, predictions)
print(f"Accuracy: {acc:.4f}") # e.g., 0.9524
# Calculate precision
prec = precision(test_dataset, predictions)
print(f"Precision: {prec:.4f}") # e.g., 0.9456
# Calculate recall
rec = recall(test_dataset, predictions)
print(f"Recall: {rec:.4f}") # e.g., 0.9501
2. Classification Metrics
Accuracy
Measures the proportion of correct predictions:
from ainxt.evaluation.classification.metrics import accuracy
# Simple accuracy
acc = accuracy(test_dataset, predictions)
# For multi-label classification with strict matching
# (all labels must match exactly)
acc_strict = accuracy(test_dataset, predictions, strict=True)
# For multi-label with label-wise matching
# (each label contributes independently)
acc_relaxed = accuracy(test_dataset, predictions, strict=False)
# With confidence threshold
# (only predictions above threshold count as positive)
acc_threshold = accuracy(test_dataset, predictions, threshold=0.5)
Source: ainxt/evaluation/classification/metrics.py:13
Parameters:
- instances: Ground truth data
- predictions: Model predictions
- threshold (optional): Confidence threshold for positive predictions
- strict (optional): For multi-label, whether to require exact match
Precision
Measures how many predicted positives are actually correct:
$$\text{Precision} = \frac{TP}{TP + FP}$$
from ainxt.evaluation.classification.metrics import precision
# Overall precision (macro-averaged across classes)
prec = precision(test_dataset, predictions)
# Precision for specific label
prec_cat = precision(test_dataset, predictions, label="cat")
# With different averaging strategies
prec_macro = precision(test_dataset, predictions, average="macro")
prec_micro = precision(test_dataset, predictions, average="micro")
prec_weighted = precision(test_dataset, predictions, average="weighted")
# With confidence threshold
prec_threshold = precision(test_dataset, predictions, threshold=0.5)
Source: ainxt/evaluation/classification/metrics.py:86
Parameters:
- instances: Ground truth data
- predictions: Model predictions
- label (optional): Compute for specific label only
- threshold (optional): Confidence threshold
- average (optional): Averaging strategy ("macro", "micro", "weighted")
Recall
Measures how many actual positives were correctly identified:
$$\text{Recall} = \frac{TP}{TP + FN}$$
from ainxt.evaluation.classification.metrics import recall
# Overall recall
rec = recall(test_dataset, predictions)
# Recall for specific label
rec_cat = recall(test_dataset, predictions, label="cat")
# With different averaging strategies
rec_macro = recall(test_dataset, predictions, average="macro")
rec_micro = recall(test_dataset, predictions, average="micro")
# With confidence threshold
rec_threshold = recall(test_dataset, predictions, threshold=0.5)
Source: ainxt/evaluation/classification/metrics.py
Parameters: Same as precision
F-Beta Score
Weighted harmonic mean of precision and recall:
$$F_\beta = (1 + \beta^2) \times \frac{\text{precision} \times \text{recall}}{\beta^2 \times \text{precision} + \text{recall}}$$
from ainxt.evaluation.classification.metrics import f1, f2
# F1 score (β=1, equal weight to precision and recall)
f1_score = f1(test_dataset, predictions)
# F2 score (β=2, more weight on recall)
f2_score = f2(test_dataset, predictions)
# For specific label
f1_cat = f1(test_dataset, predictions, label="cat")
# With averaging
f1_macro = f1(test_dataset, predictions, average="macro")
When to use what: - F1: Balanced importance of precision and recall - F2: Recall more important (e.g., medical diagnosis) - F0.5: Precision more important (e.g., spam detection)
Top-K Accuracy
Measures if correct label is in top-k predictions:
from ainxt.evaluation.classification.metrics import top_k_accuracy
# Check if correct label is in top 3 predictions
top3_acc = top_k_accuracy(test_dataset, predictions, k=3)
# Check if correct label is in top 5
top5_acc = top_k_accuracy(test_dataset, predictions, k=5)
# For specific label
top3_cat = top_k_accuracy(test_dataset, predictions, k=3, label="cat")
Use case: Multi-class problems where close alternatives are acceptable (e.g., ImageNet classification)
ROC AUC Score
Area under the Receiver Operating Characteristic curve:
from ainxt.evaluation.classification.metrics import roc_auc
# Binary classification
auc = roc_auc(test_dataset, predictions)
# Multi-class (one-vs-rest)
auc_ovr = roc_auc(test_dataset, predictions, multi_class="ovr")
# Multi-class (one-vs-one)
auc_ovo = roc_auc(test_dataset, predictions, multi_class="ovo")
# For specific label
auc_cat = roc_auc(test_dataset, predictions, label="cat")
Use case: Evaluating classifier performance across all thresholds
3. Complete Evaluation Example
From notebooks/model/SH_Model_Prediction_and_Metrics.ipynb:
from ainxt.data.split import train_test_split_dataset
from ainxt.evaluation.classification.metrics import accuracy, precision, recall
# 1. Split dataset
train_dataset, test_dataset, _ = train_test_split_dataset(
seeds_dataset,
test_size=0.1,
shuffle=True,
random_state=42
)
# 2. Train model
model = LogisticRegressionModel(labels=["1", "2", "3"])
model.fit(dataset=train_dataset)
# 3. Generate predictions
predictions = model(test_dataset)
# 4. Evaluate with multiple metrics
print(f"Accuracy: {accuracy(test_dataset, predictions):.4f}")
print(f"Precision: {precision(test_dataset, predictions):.4f}")
print(f"Recall: {recall(test_dataset, predictions):.4f}")
# Output:
# Accuracy: 0.9524
# Precision: 0.9456
# Recall: 0.9501
4. Understanding Metric Parameters
threshold: Confidence Filtering
The threshold parameter filters predictions based on confidence scores:
# Default: Use highest scoring class regardless of confidence
acc = accuracy(test_dataset, predictions)
# With threshold: Only predictions with score > 0.5 count as positive
acc_50 = accuracy(test_dataset, predictions, threshold=0.5)
# Higher threshold (more conservative)
acc_80 = accuracy(test_dataset, predictions, threshold=0.8)
Effect: - No threshold: All predictions count (uses argmax) - With threshold: Only confident predictions count (useful for multi-label)
average: How to Aggregate
The average parameter controls multi-class aggregation:
# Macro: Average of per-class scores (treats all classes equally)
prec_macro = precision(test_dataset, predictions, average="macro")
# Micro: Global average (weights by class frequency)
prec_micro = precision(test_dataset, predictions, average="micro")
# Weighted: Average weighted by class support
prec_weighted = precision(test_dataset, predictions, average="weighted")
When to use: - macro: Imbalanced datasets, care about all classes equally - micro: Balanced datasets, care about overall performance - weighted: Want to account for class imbalance in average
label: Single-Class Evaluation
Evaluate performance for one specific class:
# Overall precision across all classes
prec_all = precision(test_dataset, predictions)
# Precision only for "cat" class
prec_cat = precision(test_dataset, predictions, label="cat")
# Useful for per-class analysis
for label in ["cat", "dog", "bird"]:
prec = precision(test_dataset, predictions, label=label)
rec = recall(test_dataset, predictions, label=label)
print(f"{label}: Precision={prec:.3f}, Recall={rec:.3f}")
5. Custom Metrics
You can create custom metrics following the same pattern:
from collections.abc import Sequence
from ainxt.data import Instance
from ainxt.models import Prediction
def custom_metric(
instances: Sequence[Instance],
predictions: Sequence[Sequence[Prediction]],
**kwargs
) -> float:
"""Custom metric implementation.
Args:
instances: Ground truth data
predictions: Model predictions
**kwargs: Custom parameters
Returns:
Metric score
"""
# Extract ground truth labels
y_true = [inst.annotation.label for inst in instances]
# Extract predicted labels (highest scoring)
y_pred = [
max(pred_list[0].classification.items(), key=lambda x: x[1])[0]
for pred_list in predictions
]
# Compute your custom score
score = your_custom_calculation(y_true, y_pred)
return score
# Usage
score = custom_metric(test_dataset, predictions, param1=value1)
Example: Balanced Accuracy
from sklearn.metrics import balanced_accuracy_score
from ainxt.evaluation.classification.utils import to_arrays
def balanced_accuracy(
instances: Sequence[Instance],
predictions: Sequence[Sequence[Prediction]]
) -> float:
"""Balanced accuracy metric.
Useful for imbalanced datasets - averages recall per class.
"""
# Convert to arrays using ainxt utility
y_true, y_pred = to_arrays(instances, predictions, binary=True)
# Use sklearn's balanced accuracy
return balanced_accuracy_score(y_true, y_pred)
# Usage
bal_acc = balanced_accuracy(test_dataset, predictions)
6. Utility Functions
to_arrays: Convert to NumPy
Convert Instances and Predictions to sklearn-compatible arrays:
from ainxt.evaluation.classification.utils import to_arrays
# Convert to arrays
y_true, y_pred = to_arrays(
test_dataset,
predictions,
threshold=0.5, # Optional confidence threshold
binary=True # Return binary arrays (0/1)
)
# Now you can use any sklearn metric
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))
Source: ainxt/evaluation/classification/utils.py
get_labels: Extract All Labels
Get all unique labels from instances and predictions:
from ainxt.evaluation.utils import get_labels
# Get all labels
labels = get_labels(test_dataset, predictions)
print(labels) # ['1', '2', '3']
# Useful for iterating over all classes
for label in labels:
prec = precision(test_dataset, predictions, label=label)
print(f"Precision for {label}: {prec:.3f}")
7. Visualizations
While visualizations are typically project-specific, aiNXT provides utilities to extract data for plotting:
import matplotlib.pyplot as plt
from ainxt.evaluation.classification.utils import to_arrays
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Get arrays
y_true, y_pred = to_arrays(test_dataset, predictions, binary=True)
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["1", "2", "3"])
disp.plot()
plt.title("Seeds Dataset Confusion Matrix")
plt.show()
# ROC curves
from sklearn.metrics import roc_curve, auc
for label in ["1", "2", "3"]:
# Get probabilities for this label
y_true_binary = (y_true == label).astype(int)
y_scores = [pred[0].classification.get(label, 0) for pred in predictions]
fpr, tpr, _ = roc_curve(y_true_binary, y_scores)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"Class {label} (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves")
plt.legend()
plt.show()
Best Practices
1. Always Use Consistent Data
Ensure predictions align with instances:
# GOOD - predictions match test_dataset order
predictions = model(test_dataset)
acc = accuracy(test_dataset, predictions)
# AVOID - mismatched data
predictions = model(test_dataset)
acc = accuracy(train_dataset, predictions) # Wrong! Mismatch!
2. Choose Appropriate Metrics
Different tasks need different metrics:
# Binary classification - ROC AUC is good
auc = roc_auc(test_dataset, predictions)
# Imbalanced dataset - use macro-averaged metrics
prec = precision(test_dataset, predictions, average="macro")
# Multi-label classification - use label-wise metrics
for label in labels:
prec = precision(test_dataset, predictions, label=label)
3. Report Multiple Metrics
One metric is rarely enough:
# GOOD - comprehensive evaluation
results = {
"accuracy": accuracy(test_dataset, predictions),
"precision": precision(test_dataset, predictions),
"recall": recall(test_dataset, predictions),
"f1": f1(test_dataset, predictions),
"top3_accuracy": top_k_accuracy(test_dataset, predictions, k=3)
}
for metric, score in results.items():
print(f"{metric}: {score:.4f}")
# AVOID - relying on single metric
acc = accuracy(test_dataset, predictions) # Not enough information!
4. Use Thresholds for Confidence Filtering
For production systems, evaluate at your deployment threshold:
# Training evaluation (no threshold)
train_acc = accuracy(train_dataset, train_predictions)
# Production evaluation (with confidence threshold)
prod_acc = accuracy(test_dataset, predictions, threshold=0.8)
# This shows how model performs when low-confidence predictions are rejected
Summary
Metrics and Evaluations measure model quality:
- Consistent Interface: All metrics take
(instances, predictions, **kwargs) - Classification Metrics: accuracy, precision, recall, f1, top-k, ROC AUC
- Flexible Parameters: threshold, average, label for fine-grained control
- Utility Functions: to_arrays, get_labels for custom analysis
- Visualization-Ready: Easy integration with matplotlib, sklearn
Key principles: - Always evaluate on held-out test data - Report multiple complementary metrics - Choose metrics appropriate for your task - Consider class imbalance and thresholds - Visualize results for deeper understanding