Core Concept: Metrics & Evaluations

Overview

Metrics and Evaluations are how you measure the quality of your Model predictions against ground truth Dataset annotations. aiNXT provides standardized metrics that work directly with Instance and Prediction objects.

Real-world analogy: Metrics are like grading a test: - Ground truth (Instances): The correct answers - Predictions: The student's answers - Metrics: The grading rubric (accuracy, precision, recall, etc.)

Based on notebooks/model/SH_Model_Prediction_and_Metrics.ipynb, aiNXT metrics follow a consistent interface that makes evaluation straightforward.

1. The Metrics Interface

How Metrics Work

All aiNXT metrics follow the same pattern:

def metric_function(
    instances: Sequence[Instance],      # Ground truth data
    predictions: Sequence[Sequence[Prediction]],  # Model predictions
    **kwargs  # Metric-specific parameters
) -> float:  # Returns a score
    """
    Args:
        instances: Dataset instances with ground truth annotations
        predictions: Model predictions (one Sequence[Prediction] per instance)
        **kwargs: Additional parameters (threshold, average, etc.)

    Returns:
        Metric score as a float
    """

Key insight: Metrics receive two parallel sequences: 1. instances - The ground truth labels from your dataset 2. predictions - The model's predictions for those same instances

The metric compares them and returns a score.

Basic Metrics Usage

from ainxt.evaluation.classification.metrics import accuracy, precision, recall

# Assuming you have:
# - test_dataset: Dataset with ground truth labels
# - model: Trained model
# - predictions: model(test_dataset)

# Calculate accuracy
acc = accuracy(test_dataset, predictions)
print(f"Accuracy: {acc:.4f}")  # e.g., 0.9524

# Calculate precision
prec = precision(test_dataset, predictions)
print(f"Precision: {prec:.4f}")  # e.g., 0.9456

# Calculate recall
rec = recall(test_dataset, predictions)
print(f"Recall: {rec:.4f}")  # e.g., 0.9501

2. Classification Metrics

Accuracy

Measures the proportion of correct predictions:

from ainxt.evaluation.classification.metrics import accuracy

# Simple accuracy
acc = accuracy(test_dataset, predictions)

# For multi-label classification with strict matching
# (all labels must match exactly)
acc_strict = accuracy(test_dataset, predictions, strict=True)

# For multi-label with label-wise matching
# (each label contributes independently)
acc_relaxed = accuracy(test_dataset, predictions, strict=False)

# With confidence threshold
# (only predictions above threshold count as positive)
acc_threshold = accuracy(test_dataset, predictions, threshold=0.5)

Source: ainxt/evaluation/classification/metrics.py:13

Parameters: - instances: Ground truth data - predictions: Model predictions - threshold (optional): Confidence threshold for positive predictions - strict (optional): For multi-label, whether to require exact match

Precision

Measures how many predicted positives are actually correct:

$$\text{Precision} = \frac{TP}{TP + FP}$$

from ainxt.evaluation.classification.metrics import precision

# Overall precision (macro-averaged across classes)
prec = precision(test_dataset, predictions)

# Precision for specific label
prec_cat = precision(test_dataset, predictions, label="cat")

# With different averaging strategies
prec_macro = precision(test_dataset, predictions, average="macro")
prec_micro = precision(test_dataset, predictions, average="micro")
prec_weighted = precision(test_dataset, predictions, average="weighted")

# With confidence threshold
prec_threshold = precision(test_dataset, predictions, threshold=0.5)

Source: ainxt/evaluation/classification/metrics.py:86

Parameters: - instances: Ground truth data - predictions: Model predictions - label (optional): Compute for specific label only - threshold (optional): Confidence threshold - average (optional): Averaging strategy ("macro", "micro", "weighted")

Recall

Measures how many actual positives were correctly identified:

$$\text{Recall} = \frac{TP}{TP + FN}$$

from ainxt.evaluation.classification.metrics import recall

# Overall recall
rec = recall(test_dataset, predictions)

# Recall for specific label
rec_cat = recall(test_dataset, predictions, label="cat")

# With different averaging strategies
rec_macro = recall(test_dataset, predictions, average="macro")
rec_micro = recall(test_dataset, predictions, average="micro")

# With confidence threshold
rec_threshold = recall(test_dataset, predictions, threshold=0.5)

Source: ainxt/evaluation/classification/metrics.py

Parameters: Same as precision

F-Beta Score

Weighted harmonic mean of precision and recall:

$$F_\beta = (1 + \beta^2) \times \frac{\text{precision} \times \text{recall}}{\beta^2 \times \text{precision} + \text{recall}}$$

from ainxt.evaluation.classification.metrics import f1, f2

# F1 score (β=1, equal weight to precision and recall)
f1_score = f1(test_dataset, predictions)

# F2 score (β=2, more weight on recall)
f2_score = f2(test_dataset, predictions)

# For specific label
f1_cat = f1(test_dataset, predictions, label="cat")

# With averaging
f1_macro = f1(test_dataset, predictions, average="macro")

When to use what: - F1: Balanced importance of precision and recall - F2: Recall more important (e.g., medical diagnosis) - F0.5: Precision more important (e.g., spam detection)

Top-K Accuracy

Measures if correct label is in top-k predictions:

from ainxt.evaluation.classification.metrics import top_k_accuracy

# Check if correct label is in top 3 predictions
top3_acc = top_k_accuracy(test_dataset, predictions, k=3)

# Check if correct label is in top 5
top5_acc = top_k_accuracy(test_dataset, predictions, k=5)

# For specific label
top3_cat = top_k_accuracy(test_dataset, predictions, k=3, label="cat")

Use case: Multi-class problems where close alternatives are acceptable (e.g., ImageNet classification)

ROC AUC Score

Area under the Receiver Operating Characteristic curve:

from ainxt.evaluation.classification.metrics import roc_auc

# Binary classification
auc = roc_auc(test_dataset, predictions)

# Multi-class (one-vs-rest)
auc_ovr = roc_auc(test_dataset, predictions, multi_class="ovr")

# Multi-class (one-vs-one)
auc_ovo = roc_auc(test_dataset, predictions, multi_class="ovo")

# For specific label
auc_cat = roc_auc(test_dataset, predictions, label="cat")

Use case: Evaluating classifier performance across all thresholds

3. Complete Evaluation Example

From notebooks/model/SH_Model_Prediction_and_Metrics.ipynb:

from ainxt.data.split import train_test_split_dataset
from ainxt.evaluation.classification.metrics import accuracy, precision, recall

# 1. Split dataset
train_dataset, test_dataset, _ = train_test_split_dataset(
    seeds_dataset,
    test_size=0.1,
    shuffle=True,
    random_state=42
)

# 2. Train model
model = LogisticRegressionModel(labels=["1", "2", "3"])
model.fit(dataset=train_dataset)

# 3. Generate predictions
predictions = model(test_dataset)

# 4. Evaluate with multiple metrics
print(f"Accuracy:  {accuracy(test_dataset, predictions):.4f}")
print(f"Precision: {precision(test_dataset, predictions):.4f}")
print(f"Recall:    {recall(test_dataset, predictions):.4f}")

# Output:
# Accuracy:  0.9524
# Precision: 0.9456
# Recall:    0.9501

4. Understanding Metric Parameters

threshold: Confidence Filtering

The threshold parameter filters predictions based on confidence scores:

# Default: Use highest scoring class regardless of confidence
acc = accuracy(test_dataset, predictions)

# With threshold: Only predictions with score > 0.5 count as positive
acc_50 = accuracy(test_dataset, predictions, threshold=0.5)

# Higher threshold (more conservative)
acc_80 = accuracy(test_dataset, predictions, threshold=0.8)

Effect: - No threshold: All predictions count (uses argmax) - With threshold: Only confident predictions count (useful for multi-label)

average: How to Aggregate

The average parameter controls multi-class aggregation:

# Macro: Average of per-class scores (treats all classes equally)
prec_macro = precision(test_dataset, predictions, average="macro")

# Micro: Global average (weights by class frequency)
prec_micro = precision(test_dataset, predictions, average="micro")

# Weighted: Average weighted by class support
prec_weighted = precision(test_dataset, predictions, average="weighted")

When to use: - macro: Imbalanced datasets, care about all classes equally - micro: Balanced datasets, care about overall performance - weighted: Want to account for class imbalance in average

label: Single-Class Evaluation

Evaluate performance for one specific class:

# Overall precision across all classes
prec_all = precision(test_dataset, predictions)

# Precision only for "cat" class
prec_cat = precision(test_dataset, predictions, label="cat")

# Useful for per-class analysis
for label in ["cat", "dog", "bird"]:
    prec = precision(test_dataset, predictions, label=label)
    rec = recall(test_dataset, predictions, label=label)
    print(f"{label}: Precision={prec:.3f}, Recall={rec:.3f}")

5. Custom Metrics

You can create custom metrics following the same pattern:

from collections.abc import Sequence
from ainxt.data import Instance
from ainxt.models import Prediction

def custom_metric(
    instances: Sequence[Instance],
    predictions: Sequence[Sequence[Prediction]],
    **kwargs
) -> float:
    """Custom metric implementation.

    Args:
        instances: Ground truth data
        predictions: Model predictions
        **kwargs: Custom parameters

    Returns:
        Metric score
    """
    # Extract ground truth labels
    y_true = [inst.annotation.label for inst in instances]

    # Extract predicted labels (highest scoring)
    y_pred = [
        max(pred_list[0].classification.items(), key=lambda x: x[1])[0]
        for pred_list in predictions
    ]

    # Compute your custom score
    score = your_custom_calculation(y_true, y_pred)

    return score


# Usage
score = custom_metric(test_dataset, predictions, param1=value1)

Example: Balanced Accuracy

from sklearn.metrics import balanced_accuracy_score
from ainxt.evaluation.classification.utils import to_arrays

def balanced_accuracy(
    instances: Sequence[Instance],
    predictions: Sequence[Sequence[Prediction]]
) -> float:
    """Balanced accuracy metric.

    Useful for imbalanced datasets - averages recall per class.
    """
    # Convert to arrays using ainxt utility
    y_true, y_pred = to_arrays(instances, predictions, binary=True)

    # Use sklearn's balanced accuracy
    return balanced_accuracy_score(y_true, y_pred)


# Usage
bal_acc = balanced_accuracy(test_dataset, predictions)

6. Utility Functions

to_arrays: Convert to NumPy

Convert Instances and Predictions to sklearn-compatible arrays:

from ainxt.evaluation.classification.utils import to_arrays

# Convert to arrays
y_true, y_pred = to_arrays(
    test_dataset,
    predictions,
    threshold=0.5,  # Optional confidence threshold
    binary=True     # Return binary arrays (0/1)
)

# Now you can use any sklearn metric
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

Source: ainxt/evaluation/classification/utils.py

get_labels: Extract All Labels

Get all unique labels from instances and predictions:

from ainxt.evaluation.utils import get_labels

# Get all labels
labels = get_labels(test_dataset, predictions)
print(labels)  # ['1', '2', '3']

# Useful for iterating over all classes
for label in labels:
    prec = precision(test_dataset, predictions, label=label)
    print(f"Precision for {label}: {prec:.3f}")

7. Visualizations

While visualizations are typically project-specific, aiNXT provides utilities to extract data for plotting:

import matplotlib.pyplot as plt
from ainxt.evaluation.classification.utils import to_arrays
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Get arrays
y_true, y_pred = to_arrays(test_dataset, predictions, binary=True)

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["1", "2", "3"])
disp.plot()
plt.title("Seeds Dataset Confusion Matrix")
plt.show()

# ROC curves
from sklearn.metrics import roc_curve, auc

for label in ["1", "2", "3"]:
    # Get probabilities for this label
    y_true_binary = (y_true == label).astype(int)
    y_scores = [pred[0].classification.get(label, 0) for pred in predictions]

    fpr, tpr, _ = roc_curve(y_true_binary, y_scores)
    roc_auc = auc(fpr, tpr)

    plt.plot(fpr, tpr, label=f"Class {label} (AUC = {roc_auc:.2f})")

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves")
plt.legend()
plt.show()

Best Practices

1. Always Use Consistent Data

Ensure predictions align with instances:

# GOOD - predictions match test_dataset order
predictions = model(test_dataset)
acc = accuracy(test_dataset, predictions)

# AVOID - mismatched data
predictions = model(test_dataset)
acc = accuracy(train_dataset, predictions)  # Wrong! Mismatch!

2. Choose Appropriate Metrics

Different tasks need different metrics:

# Binary classification - ROC AUC is good
auc = roc_auc(test_dataset, predictions)

# Imbalanced dataset - use macro-averaged metrics
prec = precision(test_dataset, predictions, average="macro")

# Multi-label classification - use label-wise metrics
for label in labels:
    prec = precision(test_dataset, predictions, label=label)

3. Report Multiple Metrics

One metric is rarely enough:

# GOOD - comprehensive evaluation
results = {
    "accuracy": accuracy(test_dataset, predictions),
    "precision": precision(test_dataset, predictions),
    "recall": recall(test_dataset, predictions),
    "f1": f1(test_dataset, predictions),
    "top3_accuracy": top_k_accuracy(test_dataset, predictions, k=3)
}

for metric, score in results.items():
    print(f"{metric}: {score:.4f}")

# AVOID - relying on single metric
acc = accuracy(test_dataset, predictions)  # Not enough information!

4. Use Thresholds for Confidence Filtering

For production systems, evaluate at your deployment threshold:

# Training evaluation (no threshold)
train_acc = accuracy(train_dataset, train_predictions)

# Production evaluation (with confidence threshold)
prod_acc = accuracy(test_dataset, predictions, threshold=0.8)

# This shows how model performs when low-confidence predictions are rejected

Summary

Metrics and Evaluations measure model quality:

Consistent Interface: All metrics take (instances, predictions, **kwargs)
Classification Metrics: accuracy, precision, recall, f1, top-k, ROC AUC
Flexible Parameters: threshold, average, label for fine-grained control
Utility Functions: to_arrays, get_labels for custom analysis
Visualization-Ready: Easy integration with matplotlib, sklearn

Key principles: - Always evaluate on held-out test data - Report multiple complementary metrics - Choose metrics appropriate for your task - Consider class imbalance and thresholds - Visualize results for deeper understanding