Core Abstractions

Introduction

aiNXT provides a set of abstract base classes that define standardized interfaces for data handling and model predictions. These abstractions ensure consistency across different ML applications while maintaining type safety and flexibility.

Data Abstractions

The data layer consists of three interconnected classes that represent the fundamental building blocks of ML data handling:

graph LR
    A[Annotation] --> I[Instance]
    I --> D[Dataset]

    style A fill:#FF6B35
    style I fill:#FF6B35
    style D fill:#0F596E,color:#fff

Annotation

Purpose: Represents labels and metadata for data points.

An Annotation stores ground truth information about your data:

Labels: Classification labels (single or multiple)
Text: Free-form text annotations (for captioning, translation, etc.)
Image: Image-based annotations (numpy arrays for segmentation masks)
Span: Partial annotations indicating which part of an instance the annotation applies to
Meta: Additional metadata dictionary

Example Usage:

from ainxt.data import Annotation

# Single label classification
annotation = Annotation(labels="positive")

# Multiple labels
annotation = Annotation(labels=["happy", "excited"])

# Text annotation with span (for NER, span classification)
annotation = Annotation(
    text="Great product",
    span=(0, 12),
    labels="positive"
)

# Image segmentation mask
import numpy as np
mask = np.zeros((100, 100))
annotation = Annotation(image=mask, labels="background")

# With metadata
annotation = Annotation(
    labels="1",
    meta={1: "Class A", 2: "Class B"}  # Label mapping
)

Key Properties:

label - Returns single label (raises error if multiple/none)
labels - Returns set of all labels
whole - Boolean indicating if annotation applies to entire instance
slice - Integer tuple (start, end) from span for indexing

Why Annotation?

Standardized Format: Consistent structure across text, image, and span annotations
Type Safety: Automatic conversion of labels to sets, proper numpy array handling
Flexibility: Supports both full-instance and partial (span) annotations
Hashable: Can be used in sets and as dictionary keys

Instance

Purpose: Combines raw data with its annotations to represent a single training/inference example.

An Instance is a generic container that holds:

Data: The actual content (text, image, features, etc.)
Annotations: Ground truth labels/annotations
Meta: Mutable metadata dictionary
Hash: Deterministic identifier for the instance

Abstract Interface:

from abc import ABC, abstractmethod
from ainxt.data import Instance, Annotation

class Instance[R](ABC):
    @property
    @abstractmethod
    def data(self) -> R:
        """Returns the raw data"""
        pass

    @property
    @abstractmethod
    def annotations(self) -> Sequence[Annotation]:
        """Returns ground-truth annotations"""
        pass

    @property
    @abstractmethod
    def meta(self) -> MutableMapping[str, Any]:
        """Returns metadata dictionary"""
        pass

    @property
    @abstractmethod
    def hash(self) -> int:
        """Returns deterministic hash value"""
        pass

Using RawInstance (Concrete Implementation):

aiNXT provides RawInstance as a ready-to-use implementation:

from ainxt.data import RawInstance, Annotation

# Create instance with raw data
annotation = Annotation(labels="positive")
instance = RawInstance(
    data="This is great!",
    annotations=[annotation],
    meta={"source": "customer_review"}
)

# Access properties
print(instance.data)         # "This is great!"
print(instance.label)        # "positive"
print(instance.meta)         # {"source": "customer_review"}
print(instance.hash)         # Deterministic integer hash

Creating Custom Instances:

For domain-specific needs, inherit from Instance or use RawMixin:

from ainxt.data import Instance, RawMixin

# Option 1: Using RawMixin (provides all implementations)
class TextInstance(RawMixin[str], Instance[str]):
    pass

# Option 2: Custom implementation
class ImageInstance(Instance[np.ndarray]):
    def __init__(self, image_path: str):
        self._image = load_image(image_path)
        self._annotations = extract_annotations(image_path)
        self._meta = {}

    @property
    def data(self) -> np.ndarray:
        return self._image

    # ... implement other abstract methods

Key Features:

Generic Typing: Instance[R] where R is your data type (str, np.ndarray, etc.)
Convenience Properties:
annotation - Returns single top-level annotation (or None)
label - Returns single label from top-level annotation
labels - Returns all labels from top-level annotation
Hashable and Comparable: Can be used in sets, dictionaries, and equality checks

Dataset

Purpose: A collection of instances with standardized iteration, sizing, and persistence capabilities.

A Dataset is an iterable, sized collection of instances that can be looped over multiple times:

Abstract Interface:

from abc import ABC, abstractmethod
from ainxt.data import Dataset, Instance

class Dataset(Sized, Iterable[X], ABC):
    @abstractmethod
    def __len__(self) -> int:
        """Returns the number of instances"""
        pass

    @abstractmethod
    def __iter__(self) -> Iterator[X]:
        """Returns an iterator over instances"""
        pass

Creating Custom Datasets:

from ainxt.data import Dataset, RawInstance, Annotation

class TextClassificationDataset(Dataset):
    def __init__(self, texts: list[str], labels: list[str]):
        self.instances = [
            RawInstance(
                data=text,
                annotations=[Annotation(labels=label)]
            )
            for text, label in zip(texts, labels)
        ]

    def __len__(self):
        return len(self.instances)

    def __iter__(self):
        return iter(self.instances)

# Usage
dataset = TextClassificationDataset(
    texts=["Great!", "Terrible"],
    labels=["positive", "negative"]
)

# Standard Python iteration
for instance in dataset:
    print(f"{instance.data} -> {instance.label}")

# Can iterate multiple times
print(f"Dataset size: {len(dataset)}")

Built-in Methods:

# Materialize all instances into memory
instances = dataset.collect()  # Returns Sequence[Instance]

# Save to JSON
from ainxt.serving.serialization import ainxtJSONEncoder
dataset.save("data.json", ainxtJSONEncoder)

Key Features:

Reusable Iteration: Unlike iterators, datasets can be looped over multiple times
Type Safety: Generic Dataset[X] where X is your Instance type
Lazy or Eager: Can load data on-the-fly or materialize all at once
Serialization: Built-in JSON save support with custom encoder

Example: Lazy Loading Dataset:

class LazyCSVDataset(Dataset):
    def __init__(self, csv_path: str):
        self.csv_path = csv_path
        self._length = count_lines(csv_path)

    def __len__(self):
        return self._length

    def __iter__(self):
        # Read file fresh each iteration
        with open(self.csv_path) as f:
            for line in f:
                text, label = line.strip().split(',')
                yield RawInstance(
                    data=text,
                    annotations=[Annotation(labels=label)]
                )

Model Abstractions

The model layer defines interfaces for making predictions and training models:

graph TB
    M[Model] --> TM[TrainableModel]
    TM -.predicts.-> P[Prediction]
    D[Dataset] -.feeds.-> TM

    style M fill:#FF6B35
    style TM fill:#0F596E,color:#fff
    style P fill:#FF6B35

Model

Purpose: Base class defining the prediction interface.

All models must implement:

from abc import ABC, abstractmethod
from ainxt.models import Model, Prediction

class Model(ABC):
    @abstractmethod
    def predict(self, instance: Instance) -> Sequence[Prediction]:
        """Make predictions for a single instance"""
        pass

    @abstractmethod
    def save(self, model_dir: PathLike):
        """Save model files to directory"""
        pass

    @abstractmethod
    def load(self, model_dir: PathLike):
        """Load model files from directory"""
        pass

Callable Interface:

Models are callable and support both single instances and batches:

# Single instance prediction
predictions = model(instance)

# Batch prediction
all_predictions = model(dataset)

# Batch with batch_size
all_predictions = model(dataset, batch_size=32)

# Also accepts raw data (if model supports it)
predictions = model("This is text input")

TrainableModel

Purpose: Extends Model with training capability.

from ainxt.models import TrainableModel

class MyClassifier(TrainableModel):
    def fit(self, dataset: Dataset):
        """Training logic"""
        for instance in dataset:
            # Training code here
            pass

    def predict(self, instance: Instance) -> Sequence[Prediction]:
        """Inference logic"""
        # Prediction code
        return [Prediction(classification={...})]

    def save(self, model_dir: PathLike):
        # Save model weights
        pass

    def load(self, model_dir: PathLike):
        # Load model weights
        pass

Additional Mixins:

EvaluatableMixin: Adds evaluate(dataset) method
TrainableEvaluatableModel: Combines both training and evaluation

Prediction

Purpose: Standardized representation of model outputs.

A Prediction can contain:

Classification: Dictionary of label → score mappings
Embedding: Vector representation
Text: Generated text (translation, captioning, etc.)
Image: Generated image (numpy array)
Span: Position information (start, end)
Meta: Additional metadata

Example Usage:

from ainxt.models import Prediction

# Classification prediction
pred = Prediction(classification={"positive": 0.8, "negative": 0.2})
print(pred.label)        # "positive"
print(pred.confidence)   # 0.8

# Multi-modal prediction
pred = Prediction(
    classification={"cat": 0.9, "dog": 0.1},
    embedding=[0.1, -0.5, 0.8, ...],
    span=(10, 25),
    meta={"model_version": "v2"}
)

# Comparing predictions with tolerance
pred1 = Prediction(classification={"A": 0.6, "B": 0.4})
pred2 = Prediction(classification={"A": 0.61, "B": 0.39})
assert pred1.is_close(pred2, epsilon=0.02)  # True

Key Properties:

label - Highest scoring classification label
confidence - Score of the highest scoring label
whole - Whether prediction applies to entire input
slice - Integer tuple (start, end) from span

Type Safety and Generics

aiNXT uses Python generics for type safety:

# Dataset of specific instance type
class MyDataset(Dataset[TextInstance]):
    def __iter__(self) -> Iterator[TextInstance]:
        # IDE knows instances are TextInstance
        ...

# Model operating on specific instance type
class MyModel(TrainableModel[TextInstance]):
    def predict(self, instance: TextInstance) -> Sequence[Prediction]:
        # Type-safe: IDE knows instance is TextInstance
        ...

Summary

Abstraction	Purpose	Key Methods
Annotation	Store labels and metadata	`label`, `labels`, `whole`, `slice`
Instance	Combine data with annotations	`data`, `annotations`, `meta`, `hash`
Dataset	Collection of instances	`__len__`, `__iter__`, `collect`, `save`
Model	Make predictions	`predict`, `save`, `load`, `__call__`
TrainableModel	Train and predict	`fit`, `predict`, `save`, `load`
Prediction	Model output representation	`label`, `confidence`, `is_close`

These abstractions form the foundation upon which all aiNXT-based ML applications are built. They ensure consistency, type safety, and reusability across different projects and domains.

Next Steps

Factory System - How to register and create these objects from configuration
Training Pipeline - Using these abstractions in the train script
Evaluation Pipeline - Evaluating models with standardized metrics