Core Concept: Datasets
Overview
Datasets are the foundation of aiNXT's data handling system. Every dataset in aiNXT consists of three key components that work together: Annotation, Instance, and Dataset. This architecture ensures standardized interaction with any data source.
Real-world analogy: Think of a library: - Annotation: The label on a book (title, author, category) - Instance: A complete book (pages + label) - Dataset: The entire library collection (all books organized together)
This structure, inspired by the notebook notebooks/data/SH_Annotation_Instance_Dataset.ipynb, provides a consistent interface for all machine learning data.
1. Annotation: Labels and Metadata
What is an Annotation?
An Annotation object specifies all labels and metadata that belong to a data point. While commonly thought of as just a "label," annotations can contain various types of information:
- Labels: Classification categories, numeric values
- Text: Sentences, descriptions, or other text representations
- Image annotations: Bounding boxes, segmentation masks
- Metadata: Additional information about the annotation
Source: ainxt/data/annotation.py
Creating an Annotation
from ainxt.data import Annotation
# Simple classification annotation
annotation = Annotation(
labels="1", # The actual label value
meta={1: "Kama", 2: "Rosa", 3: "Canadian"} # Label descriptions
)
# Access the label
print(annotation.label) # "1"
# Access the metadata
print(annotation.meta) # {1: "Kama", 2: "Rosa", 3: "Canadian"}
# Get human-readable description
print(annotation.meta[int(annotation.label)]) # "Kama"
Why Annotations?
- Separation of concerns: Keep labels separate from raw data
- Metadata support: Store label descriptions, confidence scores, etc.
- Type flexibility: Support classification, regression, multi-label, etc.
- Consistency: Standard interface across all datasets
Common Annotation Patterns
# Multi-class classification
annotation = Annotation(
labels="cat",
meta={"classes": ["cat", "dog", "bird"], "confidence": 0.95}
)
# Regression
annotation = Annotation(
labels=42.5,
meta={"unit": "celsius", "sensor_id": "temp_01"}
)
# Multi-label classification
annotation = Annotation(
labels=["cat", "indoor", "sleeping"],
meta={"annotator": "expert_1", "date": "2024-10-14"}
)
2. Instance: The Data Container
What is an Instance?
An Instance object represents a single data point in your dataset. It contains:
- Raw data (data): The actual input (array, image, text, etc.)
- Annotations (annotations): One or more Annotation objects
- Metadata (meta): Information about the data point itself
Real-world analogy: An Instance is like a labeled photograph in your photo library:
- Photo pixels = data
- Tag/category = annotations
- Date taken, camera model = meta
Source: ainxt/data/instance.py
Creating Instances
from ainxt.data import RawInstance, Annotation
# Example: Seeds dataset row
raw_data = [15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22] # 7 features
label = "1"
# Create annotation
annotation = Annotation(
labels=label,
meta={1: "Kama", 2: "Rosa", 3: "Canadian"}
)
# Create metadata for the instance
column_names = {
0: "Area",
1: "Perimeter",
2: "Compactness",
3: "Length of kernel",
4: "Width of kernel",
5: "Asymmetry coefficient",
6: "Length of kernel groove"
}
# Create instance
instance = RawInstance(
data=raw_data,
annotations=[annotation],
meta=column_names
)
# Access instance properties
print(instance.data) # [15.26, 14.84, ...]
print(instance.label) # "1"
print(instance.annotation) # First annotation
print(instance.meta) # {0: "Area", 1: "Perimeter", ...}
Custom Instance Classes
You can create custom Instance classes for specific data types. This is useful when: - Data has a specific structure (images, audio, documents) - You need special parsing logic - You want domain-specific methods - You need custom equality or hashing
Example: Seeds-specific Instance
import hashlib
from ainxt.data import RawInstance, Annotation
from typing import Sequence, Any, MutableMapping
class SeedsInstance(RawInstance):
"""Instance class specifically for Seeds dataset."""
COLUMNS = {
0: "Area",
1: "Perimeter",
2: "Compactness",
3: "Length of kernel",
4: "Width of kernel",
5: "Asymmetry coefficient",
6: "Length of kernel groove"
}
CLASSES = {"1": "Kama", "2": "Rosa", "3": "Canadian"}
@classmethod
def create_raw(
cls,
data,
annotations: Sequence[Annotation] = None,
meta: MutableMapping[str, Any] = None
) -> "SeedsInstance":
"""Create SeedsInstance directly from raw row (including label).
Args:
data: Row with 8 values (7 features + 1 label)
annotations: Optional annotations (auto-created if None)
meta: Optional metadata (auto-set if None)
Returns:
SeedsInstance with parsed data and annotations
"""
if annotations is None:
# Extract label from last column
label = str(int(data[7]))
annotations = [Annotation(labels=label, meta=cls.CLASSES)]
# Use only first 7 columns as features
data = data[:7].copy()
if meta is None:
meta = cls.COLUMNS
return cls(data, annotations, meta)
@property
def hash(self) -> str:
"""Generate deterministic hash for this instance."""
string_representation = "".join(str(element) for element in self.data)
hash_object = hashlib.sha256(string_representation.encode())
return hash_object.hexdigest()
def __eq__(self, other):
"""Custom equality based on hash and annotations."""
return (
isinstance(other, SeedsInstance)
and self.hash == other.hash
and self.annotations == other.annotations
and self.meta == other.meta
)
# Usage with __init__ (manual annotation)
instance1 = SeedsInstance(
data=[15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22],
annotations=[Annotation(labels="1", meta=SeedsInstance.CLASSES)],
meta=SeedsInstance.COLUMNS
)
# Usage with create_raw (auto-extracts label from row)
raw_row = [15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22, 1] # Label at end
instance2 = SeedsInstance.create_raw(data=raw_row)
# Both produce equivalent instances
print(instance1.label) # "1"
print(instance2.label) # "1"
print(instance2.hash) # SHA256 hash of the data
The AcceptsRawInputMixin
The AcceptsRawInputMixin class provides automatic conversion from raw data to Instance objects:
from ainxt.data import AcceptsRawInputMixin
class MyInstance(AcceptsRawInputMixin, RawInstance):
@classmethod
def create_raw(cls, data):
# Your custom parsing logic
return cls(data=data, annotations=[...], meta={...})
# The mixin provides parse_input() which calls create_raw()
# This enables datasets to accept raw data directly
3. Dataset: The Collection
What is a Dataset?
A Dataset is an iterable, sized collection of Instance objects. It represents your entire dataset and provides standardized access to your data.
Source: ainxt/data/dataset.py
The Dataset Interface
Every dataset must implement two methods:
from ainxt.data import Dataset
class MyDataset(Dataset):
def __len__(self) -> int:
"""Return the number of instances in the dataset."""
raise NotImplementedError
def __iter__(self):
"""Yield instances one by one."""
raise NotImplementedError
Complete Dataset Example: Seeds Dataset
Based on notebooks/data/SH_Annotation_Instance_Dataset.ipynb:
import csv
from pathlib import Path
from typing import Union, Sequence, Iterator, Optional
from ainxt.data import Dataset, AcceptsRawInputMixin
from ainxt.typing import PathLike
class SeedsDataset(AcceptsRawInputMixin, Dataset[SeedsInstance]):
"""Dataset class for the Seeds dataset.
Supports three initialization methods:
1. From file path (reads TSV file)
2. From list of SeedsInstance objects
3. From raw numpy array or list of lists
"""
def __init__(
self,
input: Union[PathLike, Sequence[float], Sequence[SeedsInstance]],
meta: Optional[str] = None
):
self.meta = meta
if isinstance(input, (str, Path)):
# Load from file
self.data = self.parse_seed_data(input)
elif isinstance(input, Sequence):
if len(input) > 0 and isinstance(input[0], SeedsInstance):
# Already SeedsInstance objects
self.data = list(input)
else:
# Raw data - parse each row
self.data = [self.parse_input(row) for row in input]
else:
raise TypeError(
"Expected PathLike, Sequence[float], or Sequence[SeedsInstance]"
)
def parse_seed_data(self, path: PathLike) -> list[SeedsInstance]:
"""Parse TSV file into SeedsInstance objects."""
with open(path, "r", newline="") as datafile:
csvreader = csv.reader(datafile, delimiter="\t")
return [self.parse_input(row) for row in csvreader]
def __iter__(self) -> Iterator[SeedsInstance]:
"""Iterate over all instances."""
yield from self.data
def __len__(self) -> int:
"""Return number of instances."""
return len(self.data)
# Optional: Additional utility methods
def get_values(self, index: int):
"""Get feature values for a specific instance as DataFrame."""
import pandas as pd
instance = self.data[index]
values = [instance.data]
columns = instance.COLUMNS.values()
return pd.DataFrame(values, columns=columns)
def get_label_value(self, index: int) -> str:
"""Get human-readable label for a specific instance."""
instance_annotation = self.data[index].annotation
return instance_annotation.meta[instance_annotation.label]
# Usage 1: Initialize from file path
dataset = SeedsDataset("../files/seeds_dataset.txt")
print(len(dataset)) # 210
print(dataset.get_label_value(42)) # "Kama"
# Usage 2: Initialize from list of instances
subset = dataset.data[:5]
small_dataset = SeedsDataset(subset)
print(len(small_dataset)) # 5
# Usage 3: Initialize from raw numpy array
import pandas as pd
df = pd.read_csv("seeds_dataset.txt", sep="\t", header=None)
raw_array_dataset = SeedsDataset(df.to_numpy())
print(len(raw_array_dataset)) # 210
Iterating Over Datasets
Datasets are designed to be iterable:
# Iterate over all instances
for instance in dataset:
print(instance.data)
print(instance.label)
# Convert to list
all_instances = list(dataset)
# Get specific instance by index
first_instance = dataset.data[0]
# Check size
print(f"Dataset contains {len(dataset)} instances")
Dataset Splitting
aiNXT provides built-in dataset splitting functionality:
from ainxt.data.split import train_test_split_dataset
# Simple train/test split
train_dataset, test_dataset, _ = train_test_split_dataset(
dataset=seeds_dataset,
test_size=0.2,
shuffle=True,
random_state=42
)
print(f"Train: {len(train_dataset)} instances") # 168
print(f"Test: {len(test_dataset)} instances") # 42
# With validation set
train, test, val = train_test_split_dataset(
dataset=seeds_dataset,
test_size=0.2,
validation_size=0.1, # 10% of train becomes validation
shuffle=True,
stratify=True, # Maintain label distribution
random_state=42
)
# Check label distributions
from ainxt.data.utils import get_label_distribution
print(get_label_distribution(train)) # {'1': 56, '2': 56, '3': 56}
print(get_label_distribution(test)) # {'1': 14, '2': 14, '3': 14}
print(get_label_distribution(val)) # {'1': 7, '2': 7, '3': 7}
Key insight: The split datasets are NEW Dataset objects of the same class, containing different subsets of instances. All functionality of the original dataset is preserved.
Key Design Principles
1. Separation of Concerns
# Annotation: Just the label + metadata
annotation = Annotation(labels="cat", meta={...})
# Instance: Raw data + annotations + instance metadata
instance = RawInstance(data=image, annotations=[annotation], meta={...})
# Dataset: Collection of instances + dataset-level operations
dataset = MyDataset(path="/data/images")
2. Immutability
Datasets and instances should not be modified after creation. Use decorators for transformations:
# GOOD - use decorators for modifications
dataset = MyDataset("/data")
augmented = AugmentedDataset(dataset, augmenters=[...])
filtered = FilterMetaDataset(augmented, attribute="quality", values=[4, 5])
# AVOID - modifying datasets directly
dataset.data.append(new_instance) # Don't do this!
3. Type Consistency
Use type hints to document what your dataset contains:
from typing import TypeVar
import numpy as np
# Clear type information
class ImageDataset(Dataset[RawInstance[np.ndarray]]):
"""Dataset yielding numpy array images."""
pass
class TextDataset(Dataset[RawInstance[str]]):
"""Dataset yielding text strings."""
pass
Registration with Factory
Once created, datasets can be registered for configuration-based loading:
from ainxt.serving import DATASETS
# Manual registration
DATASETS.register(
task="classification",
name="seeds",
constructor=SeedsDataset
)
# Use in configuration
config = {
"name": "seeds",
"input": "/data/seeds_dataset.txt"
}
dataset = DATASETS.build(**config)
Or use Loaders for automatic discovery (see Loaders).
Summary
The three-layer architecture provides standardization:
- Annotation: Labels and their metadata
- Created once per data point
- Separates labels from raw data
-
Supports multiple annotation types
-
Instance: Data + Annotation + metadata
- Represents a single training example
- Consistent interface across datasets
-
Can be customized for specific domains
-
Dataset: Collection of Instances
- Iterable and sized
- Supports multiple initialization methods
- Can be split, filtered, and decorated
This architecture enables: - Consistent interaction with any data source - Easy integration with Factories and Loaders - Standardized training pipelines - Framework-agnostic data handling
See Also
- Dataset Decorators - Transform datasets without modifying them
- Augmenters - Data augmentation for datasets
- Factory - Create datasets from configuration
- Loaders - Automatic dataset discovery