Core Concept: Datasets

Overview

Datasets are the foundation of aiNXT's data handling system. Every dataset in aiNXT consists of three key components that work together: Annotation, Instance, and Dataset. This architecture ensures standardized interaction with any data source.

Real-world analogy: Think of a library: - Annotation: The label on a book (title, author, category) - Instance: A complete book (pages + label) - Dataset: The entire library collection (all books organized together)

This structure, inspired by the notebook notebooks/data/SH_Annotation_Instance_Dataset.ipynb, provides a consistent interface for all machine learning data.

1. Annotation: Labels and Metadata

What is an Annotation?

An Annotation object specifies all labels and metadata that belong to a data point. While commonly thought of as just a "label," annotations can contain various types of information:

Labels: Classification categories, numeric values
Text: Sentences, descriptions, or other text representations
Image annotations: Bounding boxes, segmentation masks
Metadata: Additional information about the annotation

Source: ainxt/data/annotation.py

Creating an Annotation

from ainxt.data import Annotation

# Simple classification annotation
annotation = Annotation(
    labels="1",  # The actual label value
    meta={1: "Kama", 2: "Rosa", 3: "Canadian"}  # Label descriptions
)

# Access the label
print(annotation.label)  # "1"

# Access the metadata
print(annotation.meta)  # {1: "Kama", 2: "Rosa", 3: "Canadian"}

# Get human-readable description
print(annotation.meta[int(annotation.label)])  # "Kama"

Why Annotations?

Separation of concerns: Keep labels separate from raw data
Metadata support: Store label descriptions, confidence scores, etc.
Type flexibility: Support classification, regression, multi-label, etc.
Consistency: Standard interface across all datasets

Common Annotation Patterns

# Multi-class classification
annotation = Annotation(
    labels="cat",
    meta={"classes": ["cat", "dog", "bird"], "confidence": 0.95}
)

# Regression
annotation = Annotation(
    labels=42.5,
    meta={"unit": "celsius", "sensor_id": "temp_01"}
)

# Multi-label classification
annotation = Annotation(
    labels=["cat", "indoor", "sleeping"],
    meta={"annotator": "expert_1", "date": "2024-10-14"}
)

2. Instance: The Data Container

What is an Instance?

An Instance object represents a single data point in your dataset. It contains: - Raw data (data): The actual input (array, image, text, etc.) - Annotations (annotations): One or more Annotation objects - Metadata (meta): Information about the data point itself

Real-world analogy: An Instance is like a labeled photograph in your photo library: - Photo pixels = data - Tag/category = annotations - Date taken, camera model = meta

Source: ainxt/data/instance.py

Creating Instances

from ainxt.data import RawInstance, Annotation

# Example: Seeds dataset row
raw_data = [15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22]  # 7 features
label = "1"

# Create annotation
annotation = Annotation(
    labels=label,
    meta={1: "Kama", 2: "Rosa", 3: "Canadian"}
)

# Create metadata for the instance
column_names = {
    0: "Area",
    1: "Perimeter",
    2: "Compactness",
    3: "Length of kernel",
    4: "Width of kernel",
    5: "Asymmetry coefficient",
    6: "Length of kernel groove"
}

# Create instance
instance = RawInstance(
    data=raw_data,
    annotations=[annotation],
    meta=column_names
)

# Access instance properties
print(instance.data)        # [15.26, 14.84, ...]
print(instance.label)       # "1"
print(instance.annotation)  # First annotation
print(instance.meta)        # {0: "Area", 1: "Perimeter", ...}

Custom Instance Classes

You can create custom Instance classes for specific data types. This is useful when: - Data has a specific structure (images, audio, documents) - You need special parsing logic - You want domain-specific methods - You need custom equality or hashing

Example: Seeds-specific Instance

import hashlib
from ainxt.data import RawInstance, Annotation
from typing import Sequence, Any, MutableMapping

class SeedsInstance(RawInstance):
    """Instance class specifically for Seeds dataset."""

    COLUMNS = {
        0: "Area",
        1: "Perimeter",
        2: "Compactness",
        3: "Length of kernel",
        4: "Width of kernel",
        5: "Asymmetry coefficient",
        6: "Length of kernel groove"
    }

    CLASSES = {"1": "Kama", "2": "Rosa", "3": "Canadian"}

    @classmethod
    def create_raw(
        cls,
        data,
        annotations: Sequence[Annotation] = None,
        meta: MutableMapping[str, Any] = None
    ) -> "SeedsInstance":
        """Create SeedsInstance directly from raw row (including label).

        Args:
            data: Row with 8 values (7 features + 1 label)
            annotations: Optional annotations (auto-created if None)
            meta: Optional metadata (auto-set if None)

        Returns:
            SeedsInstance with parsed data and annotations
        """
        if annotations is None:
            # Extract label from last column
            label = str(int(data[7]))
            annotations = [Annotation(labels=label, meta=cls.CLASSES)]
            # Use only first 7 columns as features
            data = data[:7].copy()

        if meta is None:
            meta = cls.COLUMNS

        return cls(data, annotations, meta)

    @property
    def hash(self) -> str:
        """Generate deterministic hash for this instance."""
        string_representation = "".join(str(element) for element in self.data)
        hash_object = hashlib.sha256(string_representation.encode())
        return hash_object.hexdigest()

    def __eq__(self, other):
        """Custom equality based on hash and annotations."""
        return (
            isinstance(other, SeedsInstance)
            and self.hash == other.hash
            and self.annotations == other.annotations
            and self.meta == other.meta
        )


# Usage with __init__ (manual annotation)
instance1 = SeedsInstance(
    data=[15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22],
    annotations=[Annotation(labels="1", meta=SeedsInstance.CLASSES)],
    meta=SeedsInstance.COLUMNS
)

# Usage with create_raw (auto-extracts label from row)
raw_row = [15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22, 1]  # Label at end
instance2 = SeedsInstance.create_raw(data=raw_row)

# Both produce equivalent instances
print(instance1.label)  # "1"
print(instance2.label)  # "1"
print(instance2.hash)   # SHA256 hash of the data

The AcceptsRawInputMixin

The AcceptsRawInputMixin class provides automatic conversion from raw data to Instance objects:

from ainxt.data import AcceptsRawInputMixin

class MyInstance(AcceptsRawInputMixin, RawInstance):
    @classmethod
    def create_raw(cls, data):
        # Your custom parsing logic
        return cls(data=data, annotations=[...], meta={...})

# The mixin provides parse_input() which calls create_raw()
# This enables datasets to accept raw data directly

3. Dataset: The Collection

What is a Dataset?

A Dataset is an iterable, sized collection of Instance objects. It represents your entire dataset and provides standardized access to your data.

Source: ainxt/data/dataset.py

The Dataset Interface

Every dataset must implement two methods:

from ainxt.data import Dataset

class MyDataset(Dataset):
    def __len__(self) -> int:
        """Return the number of instances in the dataset."""
        raise NotImplementedError

    def __iter__(self):
        """Yield instances one by one."""
        raise NotImplementedError

Complete Dataset Example: Seeds Dataset

Based on notebooks/data/SH_Annotation_Instance_Dataset.ipynb:

import csv
from pathlib import Path
from typing import Union, Sequence, Iterator, Optional
from ainxt.data import Dataset, AcceptsRawInputMixin
from ainxt.typing import PathLike

class SeedsDataset(AcceptsRawInputMixin, Dataset[SeedsInstance]):
    """Dataset class for the Seeds dataset.

    Supports three initialization methods:
    1. From file path (reads TSV file)
    2. From list of SeedsInstance objects
    3. From raw numpy array or list of lists
    """

    def __init__(
        self,
        input: Union[PathLike, Sequence[float], Sequence[SeedsInstance]],
        meta: Optional[str] = None
    ):
        self.meta = meta

        if isinstance(input, (str, Path)):
            # Load from file
            self.data = self.parse_seed_data(input)
        elif isinstance(input, Sequence):
            if len(input) > 0 and isinstance(input[0], SeedsInstance):
                # Already SeedsInstance objects
                self.data = list(input)
            else:
                # Raw data - parse each row
                self.data = [self.parse_input(row) for row in input]
        else:
            raise TypeError(
                "Expected PathLike, Sequence[float], or Sequence[SeedsInstance]"
            )

    def parse_seed_data(self, path: PathLike) -> list[SeedsInstance]:
        """Parse TSV file into SeedsInstance objects."""
        with open(path, "r", newline="") as datafile:
            csvreader = csv.reader(datafile, delimiter="\t")
            return [self.parse_input(row) for row in csvreader]

    def __iter__(self) -> Iterator[SeedsInstance]:
        """Iterate over all instances."""
        yield from self.data

    def __len__(self) -> int:
        """Return number of instances."""
        return len(self.data)

    # Optional: Additional utility methods
    def get_values(self, index: int):
        """Get feature values for a specific instance as DataFrame."""
        import pandas as pd
        instance = self.data[index]
        values = [instance.data]
        columns = instance.COLUMNS.values()
        return pd.DataFrame(values, columns=columns)

    def get_label_value(self, index: int) -> str:
        """Get human-readable label for a specific instance."""
        instance_annotation = self.data[index].annotation
        return instance_annotation.meta[instance_annotation.label]


# Usage 1: Initialize from file path
dataset = SeedsDataset("../files/seeds_dataset.txt")
print(len(dataset))  # 210
print(dataset.get_label_value(42))  # "Kama"

# Usage 2: Initialize from list of instances
subset = dataset.data[:5]
small_dataset = SeedsDataset(subset)
print(len(small_dataset))  # 5

# Usage 3: Initialize from raw numpy array
import pandas as pd
df = pd.read_csv("seeds_dataset.txt", sep="\t", header=None)
raw_array_dataset = SeedsDataset(df.to_numpy())
print(len(raw_array_dataset))  # 210

Iterating Over Datasets

Datasets are designed to be iterable:

# Iterate over all instances
for instance in dataset:
    print(instance.data)
    print(instance.label)

# Convert to list
all_instances = list(dataset)

# Get specific instance by index
first_instance = dataset.data[0]

# Check size
print(f"Dataset contains {len(dataset)} instances")

Dataset Splitting

aiNXT provides built-in dataset splitting functionality:

from ainxt.data.split import train_test_split_dataset

# Simple train/test split
train_dataset, test_dataset, _ = train_test_split_dataset(
    dataset=seeds_dataset,
    test_size=0.2,
    shuffle=True,
    random_state=42
)

print(f"Train: {len(train_dataset)} instances")  # 168
print(f"Test: {len(test_dataset)} instances")    # 42

# With validation set
train, test, val = train_test_split_dataset(
    dataset=seeds_dataset,
    test_size=0.2,
    validation_size=0.1,  # 10% of train becomes validation
    shuffle=True,
    stratify=True,  # Maintain label distribution
    random_state=42
)

# Check label distributions
from ainxt.data.utils import get_label_distribution

print(get_label_distribution(train))  # {'1': 56, '2': 56, '3': 56}
print(get_label_distribution(test))   # {'1': 14, '2': 14, '3': 14}
print(get_label_distribution(val))    # {'1': 7, '2': 7, '3': 7}

Key insight: The split datasets are NEW Dataset objects of the same class, containing different subsets of instances. All functionality of the original dataset is preserved.

Key Design Principles

1. Separation of Concerns

# Annotation: Just the label + metadata
annotation = Annotation(labels="cat", meta={...})

# Instance: Raw data + annotations + instance metadata
instance = RawInstance(data=image, annotations=[annotation], meta={...})

# Dataset: Collection of instances + dataset-level operations
dataset = MyDataset(path="/data/images")

2. Immutability

Datasets and instances should not be modified after creation. Use decorators for transformations:

# GOOD - use decorators for modifications
dataset = MyDataset("/data")
augmented = AugmentedDataset(dataset, augmenters=[...])
filtered = FilterMetaDataset(augmented, attribute="quality", values=[4, 5])

# AVOID - modifying datasets directly
dataset.data.append(new_instance)  # Don't do this!

3. Type Consistency

Use type hints to document what your dataset contains:

from typing import TypeVar
import numpy as np

# Clear type information
class ImageDataset(Dataset[RawInstance[np.ndarray]]):
    """Dataset yielding numpy array images."""
    pass

class TextDataset(Dataset[RawInstance[str]]):
    """Dataset yielding text strings."""
    pass

Registration with Factory

Once created, datasets can be registered for configuration-based loading:

from ainxt.serving import DATASETS

# Manual registration
DATASETS.register(
    task="classification",
    name="seeds",
    constructor=SeedsDataset
)

# Use in configuration
config = {
    "name": "seeds",
    "input": "/data/seeds_dataset.txt"
}

dataset = DATASETS.build(**config)

Or use Loaders for automatic discovery (see Loaders).

Summary

The three-layer architecture provides standardization:

Annotation: Labels and their metadata
Created once per data point
Separates labels from raw data
Supports multiple annotation types
Instance: Data + Annotation + metadata
Represents a single training example
Consistent interface across datasets
Can be customized for specific domains
Dataset: Collection of Instances
Iterable and sized
Supports multiple initialization methods
Can be split, filtered, and decorated

This architecture enables: - Consistent interaction with any data source - Easy integration with Factories and Loaders - Standardized training pipelines - Framework-agnostic data handling