Core Concept: Models and Predictions
Overview
Models are the heart of aiNXT's machine learning system. They consume Dataset objects and produce Prediction objects. The Model abstraction provides a standardized interface that works with any ML framework (PyTorch, TensorFlow, scikit-learn, etc.).
Real-world analogy: A Model is like a trained expert: - Input: Receives an Instance (a data point to analyze) - Processing: Applies learned knowledge - Output: Returns Prediction objects (diagnoses, classifications, scores)
This structure, inspired by notebooks/model/SH_Model_Prediction_and_Metrics.ipynb, ensures every model in aiNXT works the same way, regardless of the underlying framework.
1. Model: The Base Interface
What is a Model?
A Model makes predictions on Instance objects. It defines three core methods that every model must implement:
predict(): Make predictions for a single instancesave(): Serialize model to disk for deploymentload(): Restore model from disk
Source: ainxt/models/model.py
The Model Interface
from ainxt.models import Model
from ainxt.data import Instance
from ainxt.models import Prediction
from typing import Sequence
class MyModel(Model):
"""Base model interface."""
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Make predictions for a single instance.
Args:
instance: The data point to make predictions for
Returns:
List of Prediction objects (even if only one prediction)
"""
raise NotImplementedError
def save(self, model_dir: str):
"""Save all model files to directory.
Args:
model_dir: Path to directory where model files should be saved
"""
raise NotImplementedError
def load(self, model_dir: str):
"""Load model files from directory.
Args:
model_dir: Path to directory containing model files
"""
raise NotImplementedError
Why Always Return a Sequence?
Different ML tasks return different numbers of predictions: - Classification: 1 prediction for the entire image - Object Detection: N predictions (one per detected object, possibly 0) - Named Entity Recognition: M predictions (one per entity found)
To maintain consistency, ALL models return a Sequence[Prediction]:
# Classification model - always returns list with 1 prediction
predictions = classifier.predict(image) # [Prediction(label="cat", score=0.95)]
# Object detection - returns list with N predictions (or empty list)
predictions = detector.predict(image) # [Prediction(...), Prediction(...), ...]
# If no objects found
predictions = detector.predict(empty_image) # []
Complete Example: Logistic Regression Model
Based on notebooks/model/SH_Model_Prediction_and_Metrics.ipynb:
from pathlib import Path
from typing import Sequence, Optional, MutableMapping, Any
import numpy as np
from joblib import dump, load
from sklearn.linear_model import LogisticRegression
from ainxt.models import TrainableModel, Model
from ainxt.data import Dataset, RawInstance
from ainxt.models import Prediction
from ainxt.typing import PathLike
class LogisticRegressionModel(TrainableModel, Model):
"""Logistic Regression model using scikit-learn."""
def __init__(
self,
labels: Sequence[str],
params: Optional[MutableMapping[str, Any]] = None
):
"""Initialize Logistic Regression model.
Args:
labels: List of possible label values (e.g., ["1", "2", "3"])
params: Optional sklearn LogisticRegression parameters
"""
self.model = LogisticRegression()
self.labels = labels
self.params = params
def predict(self, instance: RawInstance) -> Sequence[Prediction]:
"""Make predictions for a single instance.
Args:
instance: The Instance to classify
Returns:
List containing single Prediction with classification scores
"""
# Reshape data for sklearn (expects 2D array)
data = np.array(instance.data).reshape(1, -1)
# Get predicted label and probabilities
predicted_label = str(self.model.predict(data)[0])
probabilities = self.model.predict_proba(data)[0]
# Create classification dict {label: score}
classification = dict(zip(self.labels, probabilities))
# Create prediction with metadata
meta = {"predicted_label": predicted_label}
return [Prediction(classification=classification, meta=meta)]
def fit(self, dataset: Dataset, params: Optional[dict] = None):
"""Train the model on a dataset.
Args:
dataset: Dataset containing RawInstances with labels
params: Optional training parameters to override class params
"""
# Extract features and labels from dataset
X = [obj.data for obj in iter(dataset)]
y = [obj.annotation.label for obj in iter(dataset)]
# Determine which params to use
if params is not None:
params = params["params"]
elif self.params is not None:
params = self.params
# Set model parameters if provided
if params is not None:
param_dict = {k: v for d in params for k, v in d.items()}
self.model.set_params(**param_dict)
print("Params set.")
# Train the model
self.model.fit(X, y)
print("Model fitted.")
def save(self, model_dir: PathLike):
"""Save model files to directory.
Saves two files:
- lr_labels.joblib: The label list
- lr_estimator.joblib: The trained sklearn model
Args:
model_dir: Directory to save model files
"""
import os
os.makedirs(model_dir, exist_ok=True)
dump(self.labels, Path(model_dir) / "lr_labels.joblib")
dump(self.model, Path(model_dir) / "lr_estimator.joblib")
def load(self, model_dir: PathLike):
"""Load model files from directory.
Args:
model_dir: Directory containing model files
"""
if model_dir is None:
model_dir = "/"
self.labels = load(Path(model_dir) / "lr_labels.joblib")
self.model = load(Path(model_dir) / "lr_estimator.joblib")
# Usage example
from ainxt.data.split import train_test_split_dataset
# Assume we have SeedsDataset
train_dataset, test_dataset, _ = train_test_split_dataset(
seeds_dataset,
test_size=0.1,
shuffle=True,
random_state=42
)
# Initialize and train model
model = LogisticRegressionModel(labels=["1", "2", "3"])
model.fit(dataset=train_dataset)
# Make predictions on test set
predictions = model(test_dataset) # Callable interface!
# Predictions is a list of Prediction lists
print(f"Made {len(predictions)} predictions")
print(predictions[0]) # First Prediction object
Calling Models: Flexible Interface
Models are callable and support multiple input types:
# 1. Single instance prediction
prediction = model(instance)
# Returns: [Prediction(...)]
# 2. Dataset prediction (iterates over dataset)
predictions = model(dataset)
# Returns: [[Prediction(...)], [Prediction(...)], ...] (one list per instance)
# 3. List of instances
predictions = model([instance1, instance2, instance3])
# Returns: [[Prediction(...)], [Prediction(...)], [Prediction(...)]]
2. TrainableModel: Models That Learn
What is a TrainableModel?
A TrainableModel extends the base Model interface with a fit() method for training. This separates models that need training (neural networks, sklearn models) from models that don't (rule-based systems, lookup tables).
Source: ainxt/models/model.py
The TrainableModel Interface
from ainxt.models import TrainableModel
from ainxt.data import Dataset
class MyTrainableModel(TrainableModel):
"""Trainable model interface."""
def fit(self, dataset: Dataset, **kwargs):
"""Train the model on the dataset.
Args:
dataset: Dataset containing training instances
**kwargs: Training parameters (epochs, learning_rate, etc.)
Returns:
self (for method chaining)
"""
for instance in dataset:
# Training logic here
loss = self.train_step(instance)
return self # Return self for chaining
# Also implement predict, save, load from Model
def predict(self, instance): ...
def save(self, model_dir): ...
def load(self, model_dir): ...
Training with Parsed Components
The real power of aiNXT shines when training with parsed configuration. Parameters like optimizer, loss_function, and callbacks can be automatically created from YAML:
class TensorFlowClassifier(TrainableModel):
def fit(
self,
dataset: Dataset,
optimizer=None, # ← Parsed from config by OPTIMIZERS factory!
loss_function=None, # ← Parsed from config by LOSSES factory!
callbacks=None, # ← Parsed from config by CALLBACKS factory!
epochs=10,
batch_size=32
):
"""Train with TensorFlow.
The optimizer, loss_function, and callbacks are already instantiated
objects, not config dictionaries!
"""
# Compile with parsed objects
self.model.compile(optimizer=optimizer, loss=loss_function)
# Train
for epoch in range(epochs):
for batch in self._create_batches(dataset, batch_size):
loss = self.model.train_on_batch(batch, callbacks=callbacks)
return self
# Configuration (YAML)
"""
training:
epochs: 100
batch_size: 32
optimizer: # ← Parsed by OPTIMIZERS factory
name: adam
learning_rate: 0.001
loss_function: # ← Parsed by LOSSES factory
name: categorical_crossentropy
callbacks: # ← Parsed by CALLBACKS factory
- name: early_stopping
patience: 5
- name: model_checkpoint
save_best_only: true
"""
# The Context automatically parses these into objects before calling fit()
# No manual object creation needed!
See Parsers for more details on how this works.
3. Prediction: The Output Format
What is a Prediction?
A Prediction object represents a single prediction made by a model. It contains:
- classification: Dictionary mapping labels to scores (for classification)
- label: The predicted label (convenience property)
- score: The prediction confidence/probability
- meta: Additional metadata about the prediction
Source: ainxt/models/prediction.py
Creating Predictions
from ainxt.models import Prediction
# Classification prediction with multiple class scores
prediction = Prediction(
classification={
"cat": 0.85,
"dog": 0.10,
"bird": 0.05
},
meta={"model_version": "v2.1", "predicted_label": "cat"}
)
# Access properties
print(prediction.label) # "cat" (highest scoring class)
print(prediction.score) # 0.85 (score of highest class)
print(prediction.classification) # {"cat": 0.85, "dog": 0.10, "bird": 0.05}
print(prediction.meta) # {"model_version": "v2.1", ...}
# Simple binary prediction
prediction = Prediction(
classification={"positive": 0.92, "negative": 0.08}
)
Why Prediction Objects?
- Standardization: All models return the same format
- Rich information: Not just a label, but scores for all classes
- Metadata support: Store model version, confidence, etc.
- Evaluation-ready: Direct input to metrics functions
Multiple Predictions per Instance
Some tasks return multiple predictions per instance:
# Object detection: Multiple bounding box predictions
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Detect all objects in image."""
detections = self.detector(instance.data)
predictions = []
for box in detections:
predictions.append(Prediction(
classification={"object": box.confidence},
meta={
"bbox": box.coordinates,
"predicted_label": box.class_name
}
))
return predictions # Multiple predictions for one instance!
# Named Entity Recognition: One prediction per entity
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Find all named entities in text."""
entities = self.ner_model(instance.data)
predictions = []
for entity in entities:
predictions.append(Prediction(
classification={entity.type: entity.confidence},
meta={
"text": entity.text,
"start": entity.start_pos,
"end": entity.end_pos
}
))
return predictions
Framework Integration Examples
PyTorch Model
import torch
import torch.nn as nn
from ainxt.models import TrainableModel
class PyTorchClassifier(TrainableModel):
"""PyTorch neural network classifier."""
def __init__(self, input_size: int, num_classes: int):
self.model = nn.Sequential(
nn.Linear(input_size, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, num_classes)
)
self.num_classes = num_classes
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Make prediction using PyTorch model."""
# Convert to tensor
tensor = torch.tensor(instance.data, dtype=torch.float32)
# Forward pass
with torch.no_grad():
logits = self.model(tensor)
probs = torch.softmax(logits, dim=0)
# Create prediction
classification = {
f"class_{i}": float(probs[i])
for i in range(self.num_classes)
}
return [Prediction(classification=classification)]
def fit(self, dataset: Dataset, optimizer=None, epochs=10, **kwargs):
"""Train PyTorch model."""
# optimizer is already instantiated by parser!
for epoch in range(epochs):
for instance in dataset:
# Convert to tensor
x = torch.tensor(instance.data, dtype=torch.float32)
y = torch.tensor(int(instance.label), dtype=torch.long)
# Forward pass
logits = self.model(x)
loss = nn.CrossEntropyLoss()(logits.unsqueeze(0), y.unsqueeze(0))
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
return self
def save(self, model_dir: str):
"""Save PyTorch model."""
torch.save(self.model.state_dict(), f"{model_dir}/model.pth")
def load(self, model_dir: str):
"""Load PyTorch model."""
self.model.load_state_dict(torch.load(f"{model_dir}/model.pth"))
Scikit-learn Model
from sklearn.ensemble import RandomForestClassifier
from ainxt.models import TrainableModel
class SKLearnClassifier(TrainableModel):
"""Random Forest classifier using scikit-learn."""
def __init__(self, n_estimators: int = 100):
self.model = RandomForestClassifier(n_estimators=n_estimators)
self.classes_ = None
def fit(self, dataset: Dataset, **kwargs):
"""Train sklearn model."""
# Extract features and labels
X = [inst.data for inst in dataset]
y = [inst.annotation.label for inst in dataset]
# Train
self.model.fit(X, y)
self.classes_ = self.model.classes_
return self
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Make prediction."""
# Get probabilities for all classes
probs = self.model.predict_proba([instance.data])[0]
# Create classification dict
classification = {
str(cls): float(prob)
for cls, prob in zip(self.classes_, probs)
}
return [Prediction(classification=classification)]
def save(self, model_dir: str):
"""Save sklearn model."""
import joblib
joblib.dump(self.model, f"{model_dir}/model.joblib")
joblib.dump(self.classes_, f"{model_dir}/classes.joblib")
def load(self, model_dir: str):
"""Load sklearn model."""
import joblib
self.model = joblib.load(f"{model_dir}/model.joblib")
self.classes_ = joblib.load(f"{model_dir}/classes.joblib")
Best Practices
1. Always Implement save() and load()
Models need to be deployable in offline environments:
# GOOD - complete implementation
def save(self, model_dir: str):
# Save weights
torch.save(self.model.state_dict(), f"{model_dir}/weights.pth")
# Save config
import json
with open(f"{model_dir}/config.json", "w") as f:
json.dump(self.config, f)
def load(self, model_dir: str):
# Load weights
self.model.load_state_dict(torch.load(f"{model_dir}/weights.pth"))
# Load config
import json
with open(f"{model_dir}/config.json", "r") as f:
self.config = json.load(f)
# AVOID - empty implementation
def save(self, model_dir: str):
pass # Don't do this!
2. Return Predictions Consistently
Always return a list, even for single predictions:
# GOOD
def predict(self, instance):
result = self.model(instance.data)
return [Prediction(classification={...})] # Always a list!
# AVOID
def predict(self, instance):
result = self.model(instance.data)
return Prediction(classification={...}) # Not a list!
3. Use Type Hints
# GOOD - clear types
from typing import Sequence
from ainxt.data import Instance
from ainxt.models import Prediction
def predict(self, instance: Instance) -> Sequence[Prediction]:
...
# AVOID - no type information
def predict(self, instance):
...
4. Make fit() Return self
This enables method chaining:
# GOOD - return self
def fit(self, dataset: Dataset, **kwargs):
# Training logic
return self
# Usage with chaining
model = MyModel().fit(train_data).fit(additional_data)
# AVOID - return None
def fit(self, dataset: Dataset, **kwargs):
# Training logic
pass # Returns None implicitly
Registration with Factory
Register models for configuration-based creation:
from ainxt.serving import MODELS
# Manual registration
MODELS.register(
task="classification",
name="logistic_regression",
constructor=LogisticRegressionModel
)
# Configuration-based creation
config = {
"name": "logistic_regression",
"labels": ["1", "2", "3"]
}
model = MODELS.build(**config)
Or use Loaders for automatic discovery (see Loaders).
Summary
Models and Predictions provide standardized ML interfaces:
- Model: Makes predictions on instances
predict(): Instance → Sequence[Prediction]save(): Serialize to diskload(): Restore from disk-
Callable interface for flexibility
-
TrainableModel: Extends Model with training
fit(): Train on Dataset- Works with parsed components (optimizer, loss, etc.)
-
Returns self for method chaining
-
Prediction: Standardized output format
- Contains classification scores
- Includes metadata
- Ready for evaluation metrics
This architecture enables: - Framework-agnostic model development - Configuration-driven training - Standardized evaluation - Easy deployment and serving
See Also
- Datasets - Input data for models
- Metrics & Evaluations - Evaluate predictions
- Parsers - Automatic creation of training components
- Factory - Create models from configuration