Core Abstractions
Introduction
aiNXT provides a set of abstract base classes that define standardized interfaces for data handling and model predictions. These abstractions ensure consistency across different ML applications while maintaining type safety and flexibility.
Data Abstractions
The data layer consists of three interconnected classes that represent the fundamental building blocks of ML data handling:
graph LR
A[Annotation] --> I[Instance]
I --> D[Dataset]
style A fill:#FF6B35
style I fill:#FF6B35
style D fill:#0F596E,color:#fff
Annotation
Purpose: Represents labels and metadata for data points.
An Annotation stores ground truth information about your data:
- Labels: Classification labels (single or multiple)
- Text: Free-form text annotations (for captioning, translation, etc.)
- Image: Image-based annotations (numpy arrays for segmentation masks)
- Span: Partial annotations indicating which part of an instance the annotation applies to
- Meta: Additional metadata dictionary
Example Usage:
from ainxt.data import Annotation
# Single label classification
annotation = Annotation(labels="positive")
# Multiple labels
annotation = Annotation(labels=["happy", "excited"])
# Text annotation with span (for NER, span classification)
annotation = Annotation(
text="Great product",
span=(0, 12),
labels="positive"
)
# Image segmentation mask
import numpy as np
mask = np.zeros((100, 100))
annotation = Annotation(image=mask, labels="background")
# With metadata
annotation = Annotation(
labels="1",
meta={1: "Class A", 2: "Class B"} # Label mapping
)
Key Properties:
label- Returns single label (raises error if multiple/none)labels- Returns set of all labelswhole- Boolean indicating if annotation applies to entire instanceslice- Integer tuple (start, end) from span for indexing
Why Annotation?
- Standardized Format: Consistent structure across text, image, and span annotations
- Type Safety: Automatic conversion of labels to sets, proper numpy array handling
- Flexibility: Supports both full-instance and partial (span) annotations
- Hashable: Can be used in sets and as dictionary keys
Instance
Purpose: Combines raw data with its annotations to represent a single training/inference example.
An Instance is a generic container that holds:
- Data: The actual content (text, image, features, etc.)
- Annotations: Ground truth labels/annotations
- Meta: Mutable metadata dictionary
- Hash: Deterministic identifier for the instance
Abstract Interface:
from abc import ABC, abstractmethod
from ainxt.data import Instance, Annotation
class Instance[R](ABC):
@property
@abstractmethod
def data(self) -> R:
"""Returns the raw data"""
pass
@property
@abstractmethod
def annotations(self) -> Sequence[Annotation]:
"""Returns ground-truth annotations"""
pass
@property
@abstractmethod
def meta(self) -> MutableMapping[str, Any]:
"""Returns metadata dictionary"""
pass
@property
@abstractmethod
def hash(self) -> int:
"""Returns deterministic hash value"""
pass
Using RawInstance (Concrete Implementation):
aiNXT provides RawInstance as a ready-to-use implementation:
from ainxt.data import RawInstance, Annotation
# Create instance with raw data
annotation = Annotation(labels="positive")
instance = RawInstance(
data="This is great!",
annotations=[annotation],
meta={"source": "customer_review"}
)
# Access properties
print(instance.data) # "This is great!"
print(instance.label) # "positive"
print(instance.meta) # {"source": "customer_review"}
print(instance.hash) # Deterministic integer hash
Creating Custom Instances:
For domain-specific needs, inherit from Instance or use RawMixin:
from ainxt.data import Instance, RawMixin
# Option 1: Using RawMixin (provides all implementations)
class TextInstance(RawMixin[str], Instance[str]):
pass
# Option 2: Custom implementation
class ImageInstance(Instance[np.ndarray]):
def __init__(self, image_path: str):
self._image = load_image(image_path)
self._annotations = extract_annotations(image_path)
self._meta = {}
@property
def data(self) -> np.ndarray:
return self._image
# ... implement other abstract methods
Key Features:
- Generic Typing:
Instance[R]where R is your data type (str, np.ndarray, etc.) - Convenience Properties:
annotation- Returns single top-level annotation (or None)label- Returns single label from top-level annotationlabels- Returns all labels from top-level annotation- Hashable and Comparable: Can be used in sets, dictionaries, and equality checks
Dataset
Purpose: A collection of instances with standardized iteration, sizing, and persistence capabilities.
A Dataset is an iterable, sized collection of instances that can be looped over multiple times:
Abstract Interface:
from abc import ABC, abstractmethod
from ainxt.data import Dataset, Instance
class Dataset(Sized, Iterable[X], ABC):
@abstractmethod
def __len__(self) -> int:
"""Returns the number of instances"""
pass
@abstractmethod
def __iter__(self) -> Iterator[X]:
"""Returns an iterator over instances"""
pass
Creating Custom Datasets:
from ainxt.data import Dataset, RawInstance, Annotation
class TextClassificationDataset(Dataset):
def __init__(self, texts: list[str], labels: list[str]):
self.instances = [
RawInstance(
data=text,
annotations=[Annotation(labels=label)]
)
for text, label in zip(texts, labels)
]
def __len__(self):
return len(self.instances)
def __iter__(self):
return iter(self.instances)
# Usage
dataset = TextClassificationDataset(
texts=["Great!", "Terrible"],
labels=["positive", "negative"]
)
# Standard Python iteration
for instance in dataset:
print(f"{instance.data} -> {instance.label}")
# Can iterate multiple times
print(f"Dataset size: {len(dataset)}")
Built-in Methods:
# Materialize all instances into memory
instances = dataset.collect() # Returns Sequence[Instance]
# Save to JSON
from ainxt.serving.serialization import ainxtJSONEncoder
dataset.save("data.json", ainxtJSONEncoder)
Key Features:
- Reusable Iteration: Unlike iterators, datasets can be looped over multiple times
- Type Safety: Generic
Dataset[X]where X is your Instance type - Lazy or Eager: Can load data on-the-fly or materialize all at once
- Serialization: Built-in JSON save support with custom encoder
Example: Lazy Loading Dataset:
class LazyCSVDataset(Dataset):
def __init__(self, csv_path: str):
self.csv_path = csv_path
self._length = count_lines(csv_path)
def __len__(self):
return self._length
def __iter__(self):
# Read file fresh each iteration
with open(self.csv_path) as f:
for line in f:
text, label = line.strip().split(',')
yield RawInstance(
data=text,
annotations=[Annotation(labels=label)]
)
Model Abstractions
The model layer defines interfaces for making predictions and training models:
graph TB
M[Model] --> TM[TrainableModel]
TM -.predicts.-> P[Prediction]
D[Dataset] -.feeds.-> TM
style M fill:#FF6B35
style TM fill:#0F596E,color:#fff
style P fill:#FF6B35
Model
Purpose: Base class defining the prediction interface.
All models must implement:
from abc import ABC, abstractmethod
from ainxt.models import Model, Prediction
class Model(ABC):
@abstractmethod
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Make predictions for a single instance"""
pass
@abstractmethod
def save(self, model_dir: PathLike):
"""Save model files to directory"""
pass
@abstractmethod
def load(self, model_dir: PathLike):
"""Load model files from directory"""
pass
Callable Interface:
Models are callable and support both single instances and batches:
# Single instance prediction
predictions = model(instance)
# Batch prediction
all_predictions = model(dataset)
# Batch with batch_size
all_predictions = model(dataset, batch_size=32)
# Also accepts raw data (if model supports it)
predictions = model("This is text input")
TrainableModel
Purpose: Extends Model with training capability.
from ainxt.models import TrainableModel
class MyClassifier(TrainableModel):
def fit(self, dataset: Dataset):
"""Training logic"""
for instance in dataset:
# Training code here
pass
def predict(self, instance: Instance) -> Sequence[Prediction]:
"""Inference logic"""
# Prediction code
return [Prediction(classification={...})]
def save(self, model_dir: PathLike):
# Save model weights
pass
def load(self, model_dir: PathLike):
# Load model weights
pass
Additional Mixins:
EvaluatableMixin: Addsevaluate(dataset)methodTrainableEvaluatableModel: Combines both training and evaluation
Prediction
Purpose: Standardized representation of model outputs.
A Prediction can contain:
- Classification: Dictionary of label → score mappings
- Embedding: Vector representation
- Text: Generated text (translation, captioning, etc.)
- Image: Generated image (numpy array)
- Span: Position information (start, end)
- Meta: Additional metadata
Example Usage:
from ainxt.models import Prediction
# Classification prediction
pred = Prediction(classification={"positive": 0.8, "negative": 0.2})
print(pred.label) # "positive"
print(pred.confidence) # 0.8
# Multi-modal prediction
pred = Prediction(
classification={"cat": 0.9, "dog": 0.1},
embedding=[0.1, -0.5, 0.8, ...],
span=(10, 25),
meta={"model_version": "v2"}
)
# Comparing predictions with tolerance
pred1 = Prediction(classification={"A": 0.6, "B": 0.4})
pred2 = Prediction(classification={"A": 0.61, "B": 0.39})
assert pred1.is_close(pred2, epsilon=0.02) # True
Key Properties:
label- Highest scoring classification labelconfidence- Score of the highest scoring labelwhole- Whether prediction applies to entire inputslice- Integer tuple (start, end) from span
Type Safety and Generics
aiNXT uses Python generics for type safety:
# Dataset of specific instance type
class MyDataset(Dataset[TextInstance]):
def __iter__(self) -> Iterator[TextInstance]:
# IDE knows instances are TextInstance
...
# Model operating on specific instance type
class MyModel(TrainableModel[TextInstance]):
def predict(self, instance: TextInstance) -> Sequence[Prediction]:
# Type-safe: IDE knows instance is TextInstance
...
Summary
| Abstraction | Purpose | Key Methods |
|---|---|---|
| Annotation | Store labels and metadata | label, labels, whole, slice |
| Instance | Combine data with annotations | data, annotations, meta, hash |
| Dataset | Collection of instances | __len__, __iter__, collect, save |
| Model | Make predictions | predict, save, load, __call__ |
| TrainableModel | Train and predict | fit, predict, save, load |
| Prediction | Model output representation | label, confidence, is_close |
These abstractions form the foundation upon which all aiNXT-based ML applications are built. They ensure consistency, type safety, and reusability across different projects and domains.
Next Steps
- Factory System - How to register and create these objects from configuration
- Training Pipeline - Using these abstractions in the train script
- Evaluation Pipeline - Evaluating models with standardized metrics