Core Concept: Models and Predictions

Overview

Models are the heart of aiNXT's machine learning system. They consume Dataset objects and produce Prediction objects. The Model abstraction provides a standardized interface that works with any ML framework (PyTorch, TensorFlow, scikit-learn, etc.).

Real-world analogy: A Model is like a trained expert: - Input: Receives an Instance (a data point to analyze) - Processing: Applies learned knowledge - Output: Returns Prediction objects (diagnoses, classifications, scores)

This structure, inspired by notebooks/model/SH_Model_Prediction_and_Metrics.ipynb, ensures every model in aiNXT works the same way, regardless of the underlying framework.

1. Model: The Base Interface

What is a Model?

A Model makes predictions on Instance objects. It defines three core methods that every model must implement:

predict(): Make predictions for a single instance
save(): Serialize model to disk for deployment
load(): Restore model from disk

Source: ainxt/models/model.py

The Model Interface

from ainxt.models import Model
from ainxt.data import Instance
from ainxt.models import Prediction
from typing import Sequence

class MyModel(Model):
    """Base model interface."""

    def predict(self, instance: Instance) -> Sequence[Prediction]:
        """Make predictions for a single instance.

        Args:
            instance: The data point to make predictions for

        Returns:
            List of Prediction objects (even if only one prediction)
        """
        raise NotImplementedError

    def save(self, model_dir: str):
        """Save all model files to directory.

        Args:
            model_dir: Path to directory where model files should be saved
        """
        raise NotImplementedError

    def load(self, model_dir: str):
        """Load model files from directory.

        Args:
            model_dir: Path to directory containing model files
        """
        raise NotImplementedError

Why Always Return a Sequence?

Different ML tasks return different numbers of predictions: - Classification: 1 prediction for the entire image - Object Detection: N predictions (one per detected object, possibly 0) - Named Entity Recognition: M predictions (one per entity found)

To maintain consistency, ALL models return a Sequence[Prediction]:

# Classification model - always returns list with 1 prediction
predictions = classifier.predict(image)  # [Prediction(label="cat", score=0.95)]

# Object detection - returns list with N predictions (or empty list)
predictions = detector.predict(image)  # [Prediction(...), Prediction(...), ...]

# If no objects found
predictions = detector.predict(empty_image)  # []

Complete Example: Logistic Regression Model

Based on notebooks/model/SH_Model_Prediction_and_Metrics.ipynb:

from pathlib import Path
from typing import Sequence, Optional, MutableMapping, Any
import numpy as np
from joblib import dump, load
from sklearn.linear_model import LogisticRegression

from ainxt.models import TrainableModel, Model
from ainxt.data import Dataset, RawInstance
from ainxt.models import Prediction
from ainxt.typing import PathLike


class LogisticRegressionModel(TrainableModel, Model):
    """Logistic Regression model using scikit-learn."""

    def __init__(
        self,
        labels: Sequence[str],
        params: Optional[MutableMapping[str, Any]] = None
    ):
        """Initialize Logistic Regression model.

        Args:
            labels: List of possible label values (e.g., ["1", "2", "3"])
            params: Optional sklearn LogisticRegression parameters
        """
        self.model = LogisticRegression()
        self.labels = labels
        self.params = params

    def predict(self, instance: RawInstance) -> Sequence[Prediction]:
        """Make predictions for a single instance.

        Args:
            instance: The Instance to classify

        Returns:
            List containing single Prediction with classification scores
        """
        # Reshape data for sklearn (expects 2D array)
        data = np.array(instance.data).reshape(1, -1)

        # Get predicted label and probabilities
        predicted_label = str(self.model.predict(data)[0])
        probabilities = self.model.predict_proba(data)[0]

        # Create classification dict {label: score}
        classification = dict(zip(self.labels, probabilities))

        # Create prediction with metadata
        meta = {"predicted_label": predicted_label}

        return [Prediction(classification=classification, meta=meta)]

    def fit(self, dataset: Dataset, params: Optional[dict] = None):
        """Train the model on a dataset.

        Args:
            dataset: Dataset containing RawInstances with labels
            params: Optional training parameters to override class params
        """
        # Extract features and labels from dataset
        X = [obj.data for obj in iter(dataset)]
        y = [obj.annotation.label for obj in iter(dataset)]

        # Determine which params to use
        if params is not None:
            params = params["params"]
        elif self.params is not None:
            params = self.params

        # Set model parameters if provided
        if params is not None:
            param_dict = {k: v for d in params for k, v in d.items()}
            self.model.set_params(**param_dict)
            print("Params set.")

        # Train the model
        self.model.fit(X, y)
        print("Model fitted.")

    def save(self, model_dir: PathLike):
        """Save model files to directory.

        Saves two files:
        - lr_labels.joblib: The label list
        - lr_estimator.joblib: The trained sklearn model

        Args:
            model_dir: Directory to save model files
        """
        import os
        os.makedirs(model_dir, exist_ok=True)

        dump(self.labels, Path(model_dir) / "lr_labels.joblib")
        dump(self.model, Path(model_dir) / "lr_estimator.joblib")

    def load(self, model_dir: PathLike):
        """Load model files from directory.

        Args:
            model_dir: Directory containing model files
        """
        if model_dir is None:
            model_dir = "/"

        self.labels = load(Path(model_dir) / "lr_labels.joblib")
        self.model = load(Path(model_dir) / "lr_estimator.joblib")


# Usage example
from ainxt.data.split import train_test_split_dataset

# Assume we have SeedsDataset
train_dataset, test_dataset, _ = train_test_split_dataset(
    seeds_dataset,
    test_size=0.1,
    shuffle=True,
    random_state=42
)

# Initialize and train model
model = LogisticRegressionModel(labels=["1", "2", "3"])
model.fit(dataset=train_dataset)

# Make predictions on test set
predictions = model(test_dataset)  # Callable interface!

# Predictions is a list of Prediction lists
print(f"Made {len(predictions)} predictions")
print(predictions[0])  # First Prediction object

Calling Models: Flexible Interface

Models are callable and support multiple input types:

# 1. Single instance prediction
prediction = model(instance)
# Returns: [Prediction(...)]

# 2. Dataset prediction (iterates over dataset)
predictions = model(dataset)
# Returns: [[Prediction(...)], [Prediction(...)], ...]  (one list per instance)

# 3. List of instances
predictions = model([instance1, instance2, instance3])
# Returns: [[Prediction(...)], [Prediction(...)], [Prediction(...)]]

2. TrainableModel: Models That Learn

What is a TrainableModel?

A TrainableModel extends the base Model interface with a fit() method for training. This separates models that need training (neural networks, sklearn models) from models that don't (rule-based systems, lookup tables).

Source: ainxt/models/model.py

The TrainableModel Interface

from ainxt.models import TrainableModel
from ainxt.data import Dataset

class MyTrainableModel(TrainableModel):
    """Trainable model interface."""

    def fit(self, dataset: Dataset, **kwargs):
        """Train the model on the dataset.

        Args:
            dataset: Dataset containing training instances
            **kwargs: Training parameters (epochs, learning_rate, etc.)

        Returns:
            self (for method chaining)
        """
        for instance in dataset:
            # Training logic here
            loss = self.train_step(instance)

        return self  # Return self for chaining

    # Also implement predict, save, load from Model
    def predict(self, instance): ...
    def save(self, model_dir): ...
    def load(self, model_dir): ...

Training with Parsed Components

The real power of aiNXT shines when training with parsed configuration. Parameters like optimizer, loss_function, and callbacks can be automatically created from YAML:

class TensorFlowClassifier(TrainableModel):
    def fit(
        self,
        dataset: Dataset,
        optimizer=None,      # ← Parsed from config by OPTIMIZERS factory!
        loss_function=None,  # ← Parsed from config by LOSSES factory!
        callbacks=None,      # ← Parsed from config by CALLBACKS factory!
        epochs=10,
        batch_size=32
    ):
        """Train with TensorFlow.

        The optimizer, loss_function, and callbacks are already instantiated
        objects, not config dictionaries!
        """
        # Compile with parsed objects
        self.model.compile(optimizer=optimizer, loss=loss_function)

        # Train
        for epoch in range(epochs):
            for batch in self._create_batches(dataset, batch_size):
                loss = self.model.train_on_batch(batch, callbacks=callbacks)

        return self


# Configuration (YAML)
"""
training:
  epochs: 100
  batch_size: 32
  optimizer:           # ← Parsed by OPTIMIZERS factory
    name: adam
    learning_rate: 0.001
  loss_function:       # ← Parsed by LOSSES factory
    name: categorical_crossentropy
  callbacks:           # ← Parsed by CALLBACKS factory
    - name: early_stopping
      patience: 5
    - name: model_checkpoint
      save_best_only: true
"""

# The Context automatically parses these into objects before calling fit()
# No manual object creation needed!

See Parsers for more details on how this works.

3. Prediction: The Output Format

What is a Prediction?

A Prediction object represents a single prediction made by a model. It contains: - classification: Dictionary mapping labels to scores (for classification) - label: The predicted label (convenience property) - score: The prediction confidence/probability - meta: Additional metadata about the prediction

Source: ainxt/models/prediction.py

Creating Predictions

from ainxt.models import Prediction

# Classification prediction with multiple class scores
prediction = Prediction(
    classification={
        "cat": 0.85,
        "dog": 0.10,
        "bird": 0.05
    },
    meta={"model_version": "v2.1", "predicted_label": "cat"}
)

# Access properties
print(prediction.label)  # "cat" (highest scoring class)
print(prediction.score)  # 0.85 (score of highest class)
print(prediction.classification)  # {"cat": 0.85, "dog": 0.10, "bird": 0.05}
print(prediction.meta)  # {"model_version": "v2.1", ...}

# Simple binary prediction
prediction = Prediction(
    classification={"positive": 0.92, "negative": 0.08}
)

Why Prediction Objects?

Standardization: All models return the same format
Rich information: Not just a label, but scores for all classes
Metadata support: Store model version, confidence, etc.
Evaluation-ready: Direct input to metrics functions

Multiple Predictions per Instance

Some tasks return multiple predictions per instance:

# Object detection: Multiple bounding box predictions
def predict(self, instance: Instance) -> Sequence[Prediction]:
    """Detect all objects in image."""
    detections = self.detector(instance.data)

    predictions = []
    for box in detections:
        predictions.append(Prediction(
            classification={"object": box.confidence},
            meta={
                "bbox": box.coordinates,
                "predicted_label": box.class_name
            }
        ))

    return predictions  # Multiple predictions for one instance!


# Named Entity Recognition: One prediction per entity
def predict(self, instance: Instance) -> Sequence[Prediction]:
    """Find all named entities in text."""
    entities = self.ner_model(instance.data)

    predictions = []
    for entity in entities:
        predictions.append(Prediction(
            classification={entity.type: entity.confidence},
            meta={
                "text": entity.text,
                "start": entity.start_pos,
                "end": entity.end_pos
            }
        ))

    return predictions

Framework Integration Examples

PyTorch Model

import torch
import torch.nn as nn
from ainxt.models import TrainableModel

class PyTorchClassifier(TrainableModel):
    """PyTorch neural network classifier."""

    def __init__(self, input_size: int, num_classes: int):
        self.model = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes)
        )
        self.num_classes = num_classes

    def predict(self, instance: Instance) -> Sequence[Prediction]:
        """Make prediction using PyTorch model."""
        # Convert to tensor
        tensor = torch.tensor(instance.data, dtype=torch.float32)

        # Forward pass
        with torch.no_grad():
            logits = self.model(tensor)
            probs = torch.softmax(logits, dim=0)

        # Create prediction
        classification = {
            f"class_{i}": float(probs[i])
            for i in range(self.num_classes)
        }

        return [Prediction(classification=classification)]

    def fit(self, dataset: Dataset, optimizer=None, epochs=10, **kwargs):
        """Train PyTorch model."""
        # optimizer is already instantiated by parser!

        for epoch in range(epochs):
            for instance in dataset:
                # Convert to tensor
                x = torch.tensor(instance.data, dtype=torch.float32)
                y = torch.tensor(int(instance.label), dtype=torch.long)

                # Forward pass
                logits = self.model(x)
                loss = nn.CrossEntropyLoss()(logits.unsqueeze(0), y.unsqueeze(0))

                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        return self

    def save(self, model_dir: str):
        """Save PyTorch model."""
        torch.save(self.model.state_dict(), f"{model_dir}/model.pth")

    def load(self, model_dir: str):
        """Load PyTorch model."""
        self.model.load_state_dict(torch.load(f"{model_dir}/model.pth"))

Scikit-learn Model

from sklearn.ensemble import RandomForestClassifier
from ainxt.models import TrainableModel

class SKLearnClassifier(TrainableModel):
    """Random Forest classifier using scikit-learn."""

    def __init__(self, n_estimators: int = 100):
        self.model = RandomForestClassifier(n_estimators=n_estimators)
        self.classes_ = None

    def fit(self, dataset: Dataset, **kwargs):
        """Train sklearn model."""
        # Extract features and labels
        X = [inst.data for inst in dataset]
        y = [inst.annotation.label for inst in dataset]

        # Train
        self.model.fit(X, y)
        self.classes_ = self.model.classes_

        return self

    def predict(self, instance: Instance) -> Sequence[Prediction]:
        """Make prediction."""
        # Get probabilities for all classes
        probs = self.model.predict_proba([instance.data])[0]

        # Create classification dict
        classification = {
            str(cls): float(prob)
            for cls, prob in zip(self.classes_, probs)
        }

        return [Prediction(classification=classification)]

    def save(self, model_dir: str):
        """Save sklearn model."""
        import joblib
        joblib.dump(self.model, f"{model_dir}/model.joblib")
        joblib.dump(self.classes_, f"{model_dir}/classes.joblib")

    def load(self, model_dir: str):
        """Load sklearn model."""
        import joblib
        self.model = joblib.load(f"{model_dir}/model.joblib")
        self.classes_ = joblib.load(f"{model_dir}/classes.joblib")

Best Practices

1. Always Implement save() and load()

Models need to be deployable in offline environments:

# GOOD - complete implementation
def save(self, model_dir: str):
    # Save weights
    torch.save(self.model.state_dict(), f"{model_dir}/weights.pth")
    # Save config
    import json
    with open(f"{model_dir}/config.json", "w") as f:
        json.dump(self.config, f)

def load(self, model_dir: str):
    # Load weights
    self.model.load_state_dict(torch.load(f"{model_dir}/weights.pth"))
    # Load config
    import json
    with open(f"{model_dir}/config.json", "r") as f:
        self.config = json.load(f)

# AVOID - empty implementation
def save(self, model_dir: str):
    pass  # Don't do this!

2. Return Predictions Consistently

Always return a list, even for single predictions:

# GOOD
def predict(self, instance):
    result = self.model(instance.data)
    return [Prediction(classification={...})]  # Always a list!

# AVOID
def predict(self, instance):
    result = self.model(instance.data)
    return Prediction(classification={...})  # Not a list!

3. Use Type Hints

# GOOD - clear types
from typing import Sequence
from ainxt.data import Instance
from ainxt.models import Prediction

def predict(self, instance: Instance) -> Sequence[Prediction]:
    ...

# AVOID - no type information
def predict(self, instance):
    ...

4. Make fit() Return self

This enables method chaining:

# GOOD - return self
def fit(self, dataset: Dataset, **kwargs):
    # Training logic
    return self

# Usage with chaining
model = MyModel().fit(train_data).fit(additional_data)

# AVOID - return None
def fit(self, dataset: Dataset, **kwargs):
    # Training logic
    pass  # Returns None implicitly

Registration with Factory

Register models for configuration-based creation:

from ainxt.serving import MODELS

# Manual registration
MODELS.register(
    task="classification",
    name="logistic_regression",
    constructor=LogisticRegressionModel
)

# Configuration-based creation
config = {
    "name": "logistic_regression",
    "labels": ["1", "2", "3"]
}

model = MODELS.build(**config)

Or use Loaders for automatic discovery (see Loaders).

Summary

Models and Predictions provide standardized ML interfaces:

Model: Makes predictions on instances
predict(): Instance → Sequence[Prediction]
save(): Serialize to disk
load(): Restore from disk
Callable interface for flexibility
TrainableModel: Extends Model with training
fit(): Train on Dataset
Works with parsed components (optimizer, loss, etc.)
Returns self for method chaining
Prediction: Standardized output format
Contains classification scores
Includes metadata
Ready for evaluation metrics

This architecture enables: - Framework-agnostic model development - Configuration-driven training - Standardized evaluation - Easy deployment and serving