Skip to content

Core Concept: Dataset Decorators

Prerequisites

Before reading this document, you should understand: - Datasets - The base dataset system - Parsers - How configuration is transformed into objects

Dataset Decorators use both Parsers AND decorator wrapping, which can be confusing without understanding Parsers first!

What are Dataset Decorators?

Dataset Decorators are a special category of functionality that wraps and modifies datasets without changing the base dataset class. They enable powerful data transformations, filtering, and augmentation through simple YAML configuration.

Real-world analogy: Think of your base dataset as a book. Decorators are like transparent overlays you place on top: - One overlay highlights important passages (filtering) - Another adds translations in margins (mapping) - Another adds bookmarks and notes (metadata)

Each overlay modifies what you see without changing the original book.


Dataset Decorators vs Regular Parsers

Before diving in, it's crucial to understand how Dataset Decorators differ from regular Parsers:

Regular Parsers

Purpose: Create NEW objects from configuration (optimizers, loss functions, augmenters, etc.)

Flow:

Configuration (dict)  →  Parser  →  New Object Created
─────────────────────────────────────────────────────

Example:
optimizer:            →  OPTIMIZERS  →  Adam(lr=0.001)
  name: adam             Parser
  learning_rate: 0.001

Code:

# Parser creates a NEW optimizer object
OPTIMIZERS = Factory()
OPTIMIZERS.register(None, "adam", AdamOptimizer)

# Usage
config = {"optimizer": {"name": "adam", "learning_rate": 0.001}}
parsed = parse_config(config, {"optimizer": OPTIMIZERS})
# Result: parsed["optimizer"] = Adam(learning_rate=0.001)  ← NEW object

Dataset Decorators

Purpose: WRAP existing datasets to modify their behavior (not create new ones)

Flow:

Base Dataset  →  Decorator  →  Wrapped Dataset (same data, modified behavior)
──────────────────────────────────────────────────────────────────────────────

Example:
ImageNetDataset  →  FilterMetaDataset  →  ImageNetDataset (filtered)
     (1000)              (quality>=4)            (800 instances)

Code:

# Decorator WRAPS the existing dataset
base_dataset = ImageNetDataset(path="/data")  # 1000 instances

# Configuration triggers decorator
config = {
    "name": "imagenet",
    "path": "/data",
    "filter_meta": {  # ← This key triggers FilterMetaDataset decorator
        "meta_attribute": "quality_score",
        "values": [4, 5],
        "operator": "in"
    }
}

# Factory creates base dataset, THEN wraps it with decorator
dataset = DATASETS.build_from_config(config)
# Result: FilterMetaDataset(ImageNetDataset(...))  ← WRAPPED, not new
# Still 'ImageNetDataset' underneath, but filtered behavior on top

Visual Comparison

┌─────────────────────────────────────────────────────────────┐
│ REGULAR PARSER                                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Config Dict          Parser              New Object       │
│  ┌──────────┐        ┌──────┐           ┌─────────┐       │
│  │optimizer:│   →    │ OPTS │    →      │ Adam    │       │
│  │  adam    │        │Factory│           │ Object  │       │
│  └──────────┘        └──────┘           └─────────┘       │
│                                                             │
│  Input: Configuration                                      │
│  Output: NEW object (optimizer, loss, scheduler, etc.)     │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ DATASET DECORATOR                                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Base Dataset        Decorator           Wrapped Dataset   │
│  ┌──────────┐       ┌─────────┐        ┌──────────────┐   │
│  │ImageNet  │   →   │Filter   │   →    │FilterMeta    │   │
│  │Dataset   │       │Meta     │        │(ImageNet)    │   │
│  │1000 imgs │       │Decorator│        │800 imgs      │   │
│  └──────────┘       └─────────┘        └──────────────┘   │
│                                              ↑              │
│                                         Same dataset,       │
│                                         modified behavior   │
│                                                             │
│  Input: Existing dataset object                            │
│  Output: WRAPPED dataset (same data, different iteration)  │
└─────────────────────────────────────────────────────────────┘

Key Differences

Aspect Regular Parser Dataset Decorator
Creates New objects Wraps existing datasets
Triggered by Config key matches parser name Config key matches decorator name
Applied to Configuration values Dataset objects
Result Brand new object Wrapped dataset
Examples Optimizers, losses, callbacks Filtering, augmentation, metadata
When During config parsing After dataset creation

The Complete Flow

Here's how they work together:

1. CONFIGURATION
   ┌──────────────────────────────────────┐
   │ dataset:                             │
   │   name: imagenet                     │
   │   path: /data                        │
   │   augmenters:  ← Parser (creates)   │
   │     - name: flip                     │
   │   filter_meta:  ← Decorator (wraps) │
   │     quality: 4                       │
   └──────────────────────────────────────┘
2. REGULAR PARSER (parse_config)
   Creates augmenter object from config
   ┌──────────────────────────────────────┐
   │ augmenters: [FlipAugmenter()]        │
   │ name: imagenet                       │
   │ path: /data                          │
   │ filter_meta: {quality: 4}            │
   └──────────────────────────────────────┘
3. FACTORY CREATES BASE DATASET
   ┌──────────────────────────────────────┐
   │ base = ImageNetDataset(              │
   │     path="/data",                    │
   │     augmenters=[FlipAugmenter()]     │
   │ )                                    │
   └──────────────────────────────────────┘
4. DATASET DECORATOR WRAPS
   ┌──────────────────────────────────────┐
   │ wrapped = FilterMetaDataset(         │
   │     base,                            │
   │     meta_attribute="quality",        │
   │     values=[4]                       │
   │ )                                    │
   └──────────────────────────────────────┘
5. FINAL RESULT
   ┌──────────────────────────────────────┐
   │ FilterMetaDataset                    │
   │   └─> ImageNetDataset                │
   │         └─> with FlipAugmenter       │
   │                                      │
   │ Behavior: Iterates only over        │
   │ high-quality images, applying flip  │
   └──────────────────────────────────────┘

Summary: - Parsers create components that get passed TO datasets/models - Dataset Decorators wrap datasets to modify HOW they behave


How Are Decorators Applied? (The Mechanics)

This is the crucial part that ties everything together. Let's trace through the EXACT code path to understand WHEN and HOW decorators are applied.

The Three Patterns in aiNXT

There are actually THREE different patterns that work together:

Aspect Simple Decorators Decorators with Parsers Pure Parsed Arguments
Examples filter_meta, hash, meta augmenters, map (with func) optimizer, loss_function (in models)
Has Decorator? ✓ YES ✓ YES ✗ NO
Has Parser? ✗ NO ✓ YES ✓ YES
Registered where factory.register_decorator() Both decorator registry AND PARSERS Only PARSERS dict
Parse phase N/A parse_config() creates sub-objects parse_config() creates objects
Wrap phase Factory.build() wraps dataset Factory.build() wraps with parsed objects N/A - passed to constructor
Decorator gets Simple values (dict/string) Parsed objects N/A
Final result Wrapped dataset Wrapped dataset with objects Base dataset/model with objects

The Key Insight: augmenters uses BOTH a parser AND a decorator working together!

Pattern 1: Simple Decorators (filter_meta, hash, meta)

Setup Phase (Happens Once)

# File: ainxt/serving/singletons.py
def create_dataset_factory(package, tasks):
    """Create factory with decorators."""

    # 1. Create loader for dataset classes
    dataset_loader = Loader(template=f"{package}.data.datasets.{{task}}", ...)

    # 2. Create loader for DECORATORS
    decorator_loader = Loader(
        template=f"{package}.data.datasets.{{task}}.decorators",  # ← Finds decorators!
        tasks=tasks
    )

    # 3. Create factory and register decorators
    factory = Factory(dataset_loader)
    factory.register_decorator(decorator_loader)  # ← KEY: Registers decorators!

    return factory

# Result: Factory has internal decorator registry
# factory.decorators = [
#     {(None, "filter_meta"): FilterMetaDataset},
#     {(None, "map"): MapDataset},
#     {(None, "hash"): HashFilterDataset},
#     ...
# ]

Execution Phase (Every time you load a dataset)

# User configuration
config = {
    "name": "imagenet",
    "path": "/data",
    "filter_meta": {  # ← Decorator trigger!
        "meta_attribute": "quality",
        "values": [4, 5]
    }
}

# User calls
dataset = CONTEXT.load_dataset(config)

Step-by-step execution:

┌──────────────────────────────────────────────────────────────────┐
│ 1. CONTEXT.load_dataset(config)                                  │
├──────────────────────────────────────────────────────────────────┤
│ → Calls: DATASETS.build(**config)                               │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 2. Factory.build(name="imagenet", path="/data",                  │
│                  filter_meta={...})                              │
├──────────────────────────────────────────────────────────────────┤
│ → Looks up constructor: DATASETS[("classification", "imagenet")]│
│ → Returns: WRAPPED function (not raw constructor!)              │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 3. Factory.__getitem__(("classification", "imagenet"))          │
│    [ainxt/factory/factory.py:165-237]                           │
├──────────────────────────────────────────────────────────────────┤
│ # Get the raw constructor                                        │
│ constructor = ImageNetDataset.__init__                           │
│                                                                  │
│ # Create wrapper function                                        │
│ def wrapper(**kwargs):                                           │
│     # Inspect constructor signature                             │
│     needed_args = ["path"]  # ImageNetDataset needs "path"      │
│                                                                  │
│     # Find extra keys (decorator triggers!)                     │
│     extra_keys = set(kwargs) - set(needed_args)                 │
│     # extra_keys = {"filter_meta"}                              │
│                                                                  │
│     # Check if extra keys match decorators                      │
│     if "filter_meta" in self.decorators:                        │
│         # FOUND DECORATOR!                                       │
│         # Wrap constructor with decorator                       │
│         constructor = wrap_with_filter_meta(constructor)        │
│                                                                  │
│     return constructor(**kwargs)                                 │
│                                                                  │
│ return wrapper  # ← Returns WRAPPED function                    │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 4. wrapper(name="imagenet", path="/data", filter_meta={...})    │
├──────────────────────────────────────────────────────────────────┤
│ # Pop decorator args                                             │
│ filter_args = kwargs.pop("filter_meta")                          │
│ # kwargs = {"name": "imagenet", "path": "/data"}                │
│                                                                  │
│ # Call original constructor                                      │
│ base = ImageNetDataset(path="/data")                            │
│                                                                  │
│ # Apply decorator                                                │
│ wrapped = FilterMetaDataset(base, **filter_args)                │
│                                                                  │
│ return wrapped                                                   │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ RESULT: User gets FilterMetaDataset(ImageNetDataset(...))       │
└──────────────────────────────────────────────────────────────────┘

Key points: 1. Decorator names MUST match config keys: "filter_meta" in config → "filter_meta" decorator 2. Factory automatically detects extra kwargs not needed by constructor 3. Wrapping happens invisibly during Factory.__getitem__() 4. User never sees the wrapping - it's automatic!

Pattern 2: Decorators with Parsers (augmenters, map with func)

This is the most complex pattern - it uses BOTH a parser AND a decorator!

The augmenters case: - AUGMENTERS Parser (in PARSERS dict) - Creates individual augmenter objects - AugmentedDataset Decorator (in decorator registry) - Wraps dataset with those objects

Setup Phase

# File: myapp/parsers/augmenter.py
# This creates INDIVIDUAL augmenter objects from config
AUGMENTERS = Factory()
AUGMENTERS.register(None, "flip", FlipAugmenter)
AUGMENTERS.register(None, "rotate", RotateAugmenter)

# File: myapp/serving/singletons.py
# Register as PARSER (for creating augmenter objects)
PARSERS = {
    "augmenters": AUGMENTERS  # ← Parser for augmenter objects
}

# File: ainxt/data/datasets/decorators.py (line 87)
# The DECORATOR is built-in to aiNXT
@builder_name("augmenters")  # ← Decorator with same name!
class AugmentedDataset(MapDataset):
    def __init__(self, dataset, augmenters):
        # Receives PARSED augmenter objects
        # Wraps the dataset to apply them
        ...

# File: ainxt/serving/singletons.py
# Decorator auto-registered by Loader
decorator_loader = Loader(
    template="ainxt.data.datasets.decorators",
    # Finds AugmentedDataset class
)
DATASETS.register_decorator(decorator_loader)

Execution Phase

# User configuration
config = {
    "name": "imagenet",
    "path": "/data",
    "augmenters": [  # ← Triggers BOTH parser AND decorator!
        {"name": "flip"},
        {"name": "rotate", "degrees": 15}
    ]
}

dataset = CONTEXT.load_dataset(config)

Step-by-step execution:

┌──────────────────────────────────────────────────────────────────┐
│ 1. CONTEXT.load_dataset(config)                                  │
├──────────────────────────────────────────────────────────────────┤
│ → Calls: Context._build(DATASETS, config)                       │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 2. Context._build() - FIRST calls parse_config()                │
│    [ainxt/scripts/context.py:96-109]                            │
├──────────────────────────────────────────────────────────────────┤
│ parsed = parse_config(config, PARSERS)                          │
│                                                                  │
│ # Check: Is "augmenters" in PARSERS?                            │
│ # YES! Use AUGMENTERS parser                                    │
│                                                                  │
│ for aug_config in config["augmenters"]:                         │
│     # Build each augmenter object                               │
│     aug = AUGMENTERS.build(name="flip")                         │
│     # aug = FlipAugmenter()                                     │
│                                                                  │
│ # Replace config value with parsed objects                      │
│ parsed = {                                                       │
│     "name": "imagenet",                                          │
│     "path": "/data",                                             │
│     "augmenters": [FlipAugmenter(), RotateAugmenter()]  ← OBJECTS│
│ }                                                                │
│                                                                  │
│ return DATASETS.build(**parsed)                                 │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 3. Factory.build() - Gets wrapped constructor                   │
│    [ainxt/factory/factory.py:165-237]                           │
├──────────────────────────────────────────────────────────────────┤
│ kwargs = {"name": "imagenet", "path": "/data",                  │
│           "augmenters": [FlipAugmenter(), ...]}  ← Objects now! │
│                                                                  │
│ # Find constructor                                               │
│ constructor = ImageNetDataset.__init__                          │
│                                                                  │
│ # Check constructor signature                                    │
│ needed_args = ["path"]  # ImageNetDataset(path)                 │
│                                                                  │
│ # Find extra keys                                                │
│ extra_keys = {"augmenters"}                                     │
│                                                                  │
│ # Check: Is "augmenters" a decorator?                           │
│ # YES! Found AugmentedDataset decorator                         │
│ # (registered by decorator_loader)                              │
│                                                                  │
│ # Wrap constructor with AugmentedDataset decorator              │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 4. Wrapped constructor executes                                  │
├──────────────────────────────────────────────────────────────────┤
│ # Pop augmenters from kwargs                                     │
│ augmenter_objects = kwargs.pop("augmenters")                    │
│ # augmenter_objects = [FlipAugmenter(), RotateAugmenter()]     │
│                                                                  │
│ # kwargs = {"path": "/data"}                                    │
│                                                                  │
│ # Create base dataset                                            │
│ base = ImageNetDataset(path="/data")                            │
│                                                                  │
│ # Apply AugmentedDataset decorator with parsed objects          │
│ wrapped = AugmentedDataset(                                      │
│     base,                                                        │
│     augmenters=[FlipAugmenter(), RotateAugmenter()]  ← Objects! │
│ )                                                                │
│                                                                  │
│ return wrapped                                                   │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ RESULT                                                           │
├──────────────────────────────────────────────────────────────────┤
│ dataset = AugmentedDataset(                                      │
│     ImageNetDataset(path="/data"),                              │
│     augmenters=[FlipAugmenter(), RotateAugmenter()]            │
│ )                                                                │
│                                                                  │
│ It's WRAPPED (decorator) AND receives PARSED OBJECTS!           │
└──────────────────────────────────────────────────────────────────┘

Key points: 1. TWO registrations with same name "augmenters": - AUGMENTERS in PARSERS → Creates augmenter objects - AugmentedDataset as decorator → Wraps dataset 2. TWO-PHASE process: - Phase 1 (parse_config): List of dicts → List of augmenter objects - Phase 2 (Factory.build): Base dataset → Wrapped with augmenter objects 3. The decorator receives parsed objects, not raw config!

What ARE Augmenters?

Augmenters are classes that transform data on-the-fly (flip, rotate, blur, etc.). They inherit from the Augmenter base class and have a __call__ method making them callable like functions.

For complete details on Augmenters, see Augmenters.

Quick example:

# Augmenter is a class configured once, called many times
flip_augmenter = FlipAugmenter(direction="horizontal")
rotated_augmenter = RotateAugmenter(max_degrees=15)

# Parser creates these from config
# Decorator wraps dataset with them
# Iteration applies them automatically

Pattern 3: Pure Parsed Arguments (optimizer, loss_function)

This pattern has NO decorator - objects are only passed to constructor.

Setup Phase

# File: myapp/parsers/augmenter.py
AUGMENTERS = Factory()
AUGMENTERS.register(None, "flip", FlipAugmenter)
AUGMENTERS.register(None, "rotate", RotateAugmenter)

# File: myapp/serving/singletons.py
PARSERS = {
    "augmenters": AUGMENTERS,  # ← Registered as PARSER, not decorator!
    "optimizer": OPTIMIZERS,
    "loss_function": LOSSES
}

# File: myapp/context.py
CONTEXT = Context(
    dataset_builder=DATASETS,
    parsers=PARSERS  # ← Parsers passed to Context!
)

Execution Phase

# User configuration
config = {
    "name": "imagenet",
    "path": "/data",
    "augmenters": [  # ← Parser trigger (NOT decorator!)
        {"name": "flip"},
        {"name": "rotate", "degrees": 15}
    ]
}

# User calls
dataset = CONTEXT.load_dataset(config)

Step-by-step execution:

┌──────────────────────────────────────────────────────────────────┐
│ 1. CONTEXT.load_dataset(config)                                  │
│    [ainxt/scripts/context.py:49-58]                             │
├──────────────────────────────────────────────────────────────────┤
│ → Calls: self._build(self.dataset_builder, config)              │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 2. Context._build(builder, config)                               │
│    [ainxt/scripts/context.py:96-109]                            │
├──────────────────────────────────────────────────────────────────┤
│ # FIRST: Parse configuration                                     │
│ parsed_config = parse_config(config, self.parsers)              │
│                                                                  │
│ # THEN: Build with parsed config                                │
│ return builder.build(**parsed_config)                           │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 3. parse_config(config, PARSERS)                                 │
│    [ainxt/serving/config.py:12-79]                              │
├──────────────────────────────────────────────────────────────────┤
│ for key, value in config.items():                               │
│     if key in parsers:  # Check if "augmenters" in PARSERS      │
│         # FOUND PARSER!                                          │
│         # Transform the value                                    │
│         if isinstance(value, list):                             │
│             # Parse each augmenter config                        │
│             augmenter_objects = []                              │
│             for aug_config in value:                            │
│                 aug = parsers[key].build(**aug_config)          │
│                 augmenter_objects.append(aug)                   │
│             config[key] = augmenter_objects                     │
│                                                                  │
│ return config                                                    │
│                                                                  │
│ # Result: config is MODIFIED                                     │
│ # config["augmenters"] = [FlipAugmenter(), RotateAugmenter()]  │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ 4. DATASETS.build(name="imagenet", path="/data",                 │
│                   augmenters=[FlipAugmenter(), ...])            │
├──────────────────────────────────────────────────────────────────┤
│ # Factory checks: Is "augmenters" a decorator?                   │
│ # NO! It's not in factory.decorators                            │
│                                                                  │
│ # Check constructor signature                                    │
│ # ImageNetDataset.__init__(path, augmenters=None)               │
│ # "augmenters" IS a constructor parameter!                      │
│                                                                  │
│ # Pass augmenters TO constructor (no wrapping!)                 │
│ base = ImageNetDataset(                                          │
│     path="/data",                                                │
│     augmenters=[FlipAugmenter(), RotateAugmenter()]            │
│ )                                                                │
│                                                                  │
│ return base  # ← Returns base dataset, NOT wrapped!            │
└───────────────────────────────┬──────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ RESULT: User gets ImageNetDataset with augmenters stored inside │
│ The dataset applies augmentation internally during iteration    │
└──────────────────────────────────────────────────────────────────┘

Key points: 1. Parsing happens BEFORE factory.build() in Context._build() 2. parse_config() transforms dicts into actual objects 3. Objects are passed as regular constructor arguments 4. NO wrapping occurs - dataset uses them internally 5. The constructor signature must accept these parameters!

Visual Comparison

TRUE DECORATORS (filter_meta, map, hash):
┌────────────┐
│ Config     │
└──────┬─────┘
┌──────────────┐      ┌─────────────┐
│ Factory.     │  →   │ Base        │
│ build()      │      │ Dataset     │
└──────────────┘      └──────┬──────┘
       ↓                     ↓
┌──────────────┐      ┌─────────────┐
│ Detects      │  →   │ Wraps with  │
│ extra kwargs │      │ Decorator   │
└──────────────┘      └──────┬──────┘
                      ┌─────────────┐
                      │ Wrapped     │
                      │ Dataset     │
                      └─────────────┘

PARSED ARGUMENTS (augmenters, optimizer):
┌────────────┐
│ Config     │
└──────┬─────┘
┌──────────────┐      ┌─────────────┐
│ parse_       │  →   │ Creates     │
│ config()     │      │ Objects     │
└──────────────┘      └──────┬──────┘
       ↓                     ↓
┌──────────────┐      ┌─────────────┐
│ Factory.     │  →   │ Passes to   │
│ build()      │      │ Constructor │
└──────────────┘      └──────┬──────┘
                      ┌─────────────┐
                      │ Base        │
                      │ Dataset     │
                      │ (with       │
                      │  objects)   │
                      └─────────────┘

How to Tell the Difference?

Check 1: Is it in PARSERS?

if "augmenters" in PARSERS:  # YES → Parsed argument
if "filter_meta" in PARSERS:  # NO → Decorator

Check 2: When is it processed?

parse_config()  # Parsed arguments processed here (before build)
Factory.build()  # Decorators processed here (during build)

Check 3: What's the constructor signature?

def __init__(self, path, augmenters=None):  # augmenters → Parsed argument
def __init__(self, path):  # No filter_meta → Decorator (added after)

Summary Table

Feature True Decorators Parsed Arguments
Config Key filter_meta, map, hash augmenters, optimizer
Registered factory.register_decorator() In PARSERS dict
When Processed During Factory.build() During parse_config()
How Applied Wraps after construction Passed to constructor
Constructor Knows No Yes (has parameter)
Final Result Wrapped dataset Base dataset with objects

Why Use Decorators?

Without Decorators (Bad)

class MyDatasetWithAugmentation(Dataset):
    def __init__(self, path, augment=False, filter_quality=False, add_metadata=False):
        self.path = path
        self.augment = augment
        self.filter_quality = filter_quality
        # Complex logic mixing concerns...

    def __getitem__(self, idx):
        item = self.load_item(idx)
        if self.filter_quality and not self.is_quality(item):
            # What to return here???
        if self.augment:
            item = self.augment_item(item)
        if self.add_metadata:
            item.meta.update(...)
        return item

With Decorators (Good)

# Base dataset stays simple
class MyDataset(Dataset):
    def __init__(self, path):
        self.path = path

    def __getitem__(self, idx):
        return self.load_item(idx)  # Just load, nothing else!
# All transformations in config
dataset:
  name: my_dataset
  path: /data
  filter_meta:  # Decorator 1
    meta_attribute: quality_score
    values: [4, 5]
    operator: in
  augmenters:  # Decorator 2
    - name: flip
  meta:  # Decorator 3
    source: "train_v2"

Available Decorators

aiNXT provides several built-in decorators in ainxt/data/datasets/decorators.py:

Decorator Config Key Purpose
MapDataset map Apply custom transformation to each instance
AugmentedDataset augmenters, augmenter Apply data augmentation
FilterDataset filter Remove instances based on custom criteria
FilterMetaDataset filter_meta Filter by metadata attributes
MetaAttributeDataset meta Add metadata to all instances
Hash-based filters hash, exclude_hash Deterministic splitting and exclusion

1. MapDataset (map)

Purpose: Apply a custom transformation function to each instance.

Source: ainxt/data/datasets/decorators.py:17-83

When to Use

  • Custom preprocessing (normalize text, resize images, extract features)
  • Data type conversions
  • Feature engineering
  • Final transformations before model input

How It Works - Visual Flow

┌─────────────────────────────────────────────────────────────────┐
│ CONFIGURATION                                                   │
├─────────────────────────────────────────────────────────────────┤
│ dataset:                                                        │
│   name: text_dataset                                            │
│   path: /data/texts.csv                                         │
│   map:  ← Decorator key                                         │
│     func: tokenize_and_normalize                                │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Factory creates base dataset                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ base_dataset = TextDataset(path="/data/texts.csv")             │
│                                                                 │
│ Instances:                                                      │
│   [0]: Instance(data="Hello World")                            │
│   [1]: Instance(data="GOODBYE Python")                         │
│   [2]: Instance(data="  Machine Learning  ")                   │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: MapDataset wraps base dataset                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ wrapped = MapDataset(                                           │
│     base_dataset,                                               │
│     func=tokenize_and_normalize  ← Transform function          │
│ )                                                               │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: When iterating, applies function to each instance      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ for instance in wrapped:                                        │
│     # Gets from base, applies function, yields result          │
│                                                                 │
│ [0]: "Hello World" → tokenize_and_normalize → ["hello", "world"]│
│ [1]: "GOODBYE Python" → tokenize_and_normalize → ["goodbye", "python"]│
│ [2]: "  Machine Learning  " → tokenize_and_normalize → ["machine", "learning"]│
└─────────────────────────────────────────────────────────────────┘

Configuration

name: my_dataset
data_path: /path/to/data.csv
map:
  func: my_preprocessing_function  # Must be registered

Example

# Register transformation function
def tokenize_and_normalize(instance):
    """Tokenize text and normalize."""
    instance.data = instance.data.lower().strip().split()
    return instance

# Register with factory
TRANSFORMATIONS = Factory()
TRANSFORMATIONS.register(None, "tokenize_and_normalize", tokenize_and_normalize)

# Use in config
```yaml
dataset:
  name: text_dataset
  path: /data/texts.csv
  map:
    func: tokenize_and_normalize

Key Features

  • Lazy evaluation (applied during iteration)
  • Chain multiple map operations
  • Access full instance (data + metadata)

2. AugmentedDataset (augmenters)

Purpose: Apply data augmentation transformations.

Source: ainxt/data/datasets/decorators.py:85-162

When to Use

  • Image augmentation (flipping, rotation, color adjustment)
  • Text augmentation (paraphrasing, typos)
  • Audio augmentation (noise, tempo changes)
  • Training data diversity

How It Works - Visual Flow

┌─────────────────────────────────────────────────────────────────┐
│ CONFIGURATION                                                   │
├─────────────────────────────────────────────────────────────────┤
│ dataset:                                                        │
│   name: image_dataset                                           │
│   path: /data/images                                            │
│   augmenters:  ← Decorator key (note: parsed by AUGMENTERS)    │
│     - name: flip                                                │
│       probability: 0.5                                          │
│     - name: rotate                                              │
│       degrees: 15                                               │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Regular parser creates augmenter objects               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ augmenters = [                                                  │
│     FlipAugmenter(probability=0.5),      ← NEW object created  │
│     RotateAugmenter(degrees=15)          ← NEW object created  │
│ ]                                                               │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: Factory creates base dataset WITH parsed augmenters    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ base_dataset = ImageDataset(                                    │
│     path="/data/images",                                        │
│     augmenters=[FlipAugmenter(...), RotateAugmenter(...)]      │
│ )                                                               │
│                                                                 │
│ Original Instances (100 images):                               │
│   [0]: Instance(data=<cat.jpg>)                                │
│   [1]: Instance(data=<dog.jpg>)                                │
│   ...                                                           │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: When iterating, applies augmenters in sequence         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ for instance in base_dataset:                                   │
│     # Applies each augmenter in order                          │
│                                                                 │
│ Instance flow:                                                  │
│   Original: <cat.jpg>                                          │
│       ↓                                                         │
│   FlipAugmenter (p=0.5): <cat.jpg> or <flipped_cat.jpg>       │
│       ↓                                                         │
│   RotateAugmenter: <rotated_15deg>                            │
│       ↓                                                         │
│   Final: <augmented_cat.jpg>                                   │
│                                                                 │
│ Note: Augmentation happens during iteration, not wrapping!     │
└─────────────────────────────────────────────────────────────────┘

Important: Unlike other decorators, augmenters uses BOTH: 1. Regular Parser (AUGMENTERS) - Creates augmenter objects from config 2. Dataset Constructor - Receives augmenter objects as parameters

The augmentation is applied internally by the dataset, not by a wrapper decorator!

Regular Decorator Pattern:          Augmenter Pattern:
┌─────────────┐                    ┌─────────────┐
│ Decorator   │                    │ Parser      │
│ wraps       │                    │ creates     │
│ Dataset     │                    │ Augmenter   │
└─────────────┘                    └─────────────┘
       ↓                                  ↓
┌─────────────┐                    ┌─────────────┐
│ Base        │                    │ Dataset     │
│ Dataset     │                    │ applies it  │
│ inside      │                    │ internally  │
└─────────────┘                    └─────────────┘

Configuration

name: my_dataset
data_path: /path/to/data.csv

# Single augmenter
augmenters:
  - name: flip_horizontal

# Multiple augmenters (applied in sequence)
augmenters:
  - name: flip_horizontal
    probability: 0.5
  - name: rotate
    angle: 15
  - name: add_gaussian_noise
    std: 0.05

Example

# Define augmenters
from ainxt.data.augmentation import Augmenter

class RandomFlip(Augmenter):
    def __init__(self, probability=0.5):
        self.probability = probability

    def __call__(self, instance):
        if random.random() < self.probability:
            instance.data = flip(instance.data)
        return instance

# Register
AUGMENTERS = Factory()
AUGMENTERS.register("image", "flip_horizontal", RandomFlip)

# Use in config
```yaml
dataset:
  task: image
  name: imagenet
  augmenters:
    - task: image
      name: flip_horizontal
      probability: 0.7

Key Features

  • Sequential application (chained together)
  • Supports single augmenter or list
  • Lazy evaluation (computed during training)
  • Uses chain() utility to combine augmenters

Best Practices

  • Apply only to training data, not validation/test
  • Order from least to most destructive
  • Use probability for stochastic augmentation
  • Balance diversity vs. training speed

3. FilterDataset (filter)

Purpose: Remove instances based on custom criteria.

Source: ainxt/data/datasets/decorators.py:164-250

When to Use

  • Quality filtering (remove corrupted data)
  • Content filtering (specific criteria)
  • Size filtering (too large/small)
  • Custom business logic

Configuration

name: my_dataset
data_path: /path/to/data.csv
filter:
  func: quality_filter  # Returns True to keep, False to remove
  size: 1000  # Optional: expected size after filtering

Example

# Define filter function
def quality_filter(instance):
    """Keep only high-quality instances."""
    return instance.meta.get('quality_score', 0) >= 7.0

# Register
FILTERS = Factory()
FILTERS.register(None, "quality_filter", quality_filter)

# Use in config
```yaml
dataset:
  name: document_dataset
  path: /data/docs
  filter:
    func: quality_filter
    size: 800  # Helps with length calculation

Key Features

  • Lazy filtering (during iteration)
  • Optional size parameter for performance
  • Without size, requires full iteration to compute length

Performance Tip

Specify size if you know the expected filtered dataset size to avoid iterating through the entire dataset just to get its length.


4. FilterMetaDataset (filter_meta)

Purpose: Filter instances by metadata attributes (simplified filtering).

Source: ainxt/data/datasets/decorators.py:252-352

When to Use

  • Category filtering (include/exclude specific labels)
  • Quality filtering (by scores or ratings)
  • Source filtering (by data source)
  • Any metadata-based filtering

How It Works - Visual Flow

┌─────────────────────────────────────────────────────────────────┐
│ CONFIGURATION                                                   │
├─────────────────────────────────────────────────────────────────┤
│ dataset:                                                        │
│   name: document_dataset                                        │
│   path: /data/docs.csv                                          │
│   filter_meta:  ← Decorator key                                 │
│     meta_attribute: quality_score                               │
│     values: [4, 5]                                              │
│     operator: in                                                │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Factory creates base dataset                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ base_dataset = DocumentDataset(path="/data/docs.csv")          │
│                                                                 │
│ All Instances (1000 total):                                    │
│   [0]: Instance(data="Doc A", meta={quality_score: 5}) ✓       │
│   [1]: Instance(data="Doc B", meta={quality_score: 2}) ✗       │
│   [2]: Instance(data="Doc C", meta={quality_score: 4}) ✓       │
│   [3]: Instance(data="Doc D", meta={quality_score: 1}) ✗       │
│   [4]: Instance(data="Doc E", meta={quality_score: 5}) ✓       │
│   ...                                                           │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: FilterMetaDataset wraps base dataset                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ wrapped = FilterMetaDataset(                                    │
│     base_dataset,                                               │
│     meta_attribute="quality_score",                             │
│     values=[4, 5],        ← Only keep these scores             │
│     operator="in"                                               │
│ )                                                               │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: When iterating, filters based on metadata              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ for instance in wrapped:                                        │
│     # Only yields instances where quality_score in [4, 5]      │
│                                                                 │
│ Filtered Instances (800 remaining):                            │
│   [0]: Instance(data="Doc A", meta={quality_score: 5}) ✓ KEPT │
│   [1]: Instance(data="Doc B", meta={quality_score: 2})   SKIPPED│
│   [2]: Instance(data="Doc C", meta={quality_score: 4}) ✓ KEPT │
│   [3]: Instance(data="Doc D", meta={quality_score: 1})   SKIPPED│
│   [4]: Instance(data="Doc E", meta={quality_score: 5}) ✓ KEPT │
│   ...                                                           │
│                                                                 │
│ Result: len(wrapped) = 800  (was 1000)                         │
└─────────────────────────────────────────────────────────────────┘

Decision Logic:

For each instance:
    ┌─────────────────────────────────────┐
    │ Get meta[quality_score]             │
    └──────────────┬──────────────────────┘
    ┌─────────────────────────────────────┐
    │ Is value in [4, 5]?                 │
    └──────┬──────────────────┬───────────┘
           ↓ YES              ↓ NO
    ┌──────────┐       ┌──────────┐
    │ YIELD    │       │ SKIP     │
    │ instance │       │ instance │
    └──────────┘       └──────────┘

Configuration

name: my_dataset
data_path: /path/to/data.csv

# Keep instances where category is in specified values
filter_meta:
  meta_attribute: category
  values: ["positive", "neutral"]
  operator: in  # or "not in"

# Exclude low-quality samples
filter_meta:
  meta_attribute: quality_score
  values: [1, 2]
  operator: not in

Example

# Keep only validated documents
dataset:
  name: document_dataset
  path: /data/docs.csv
  filter_meta:
    meta_attribute: is_validated
    values: [true]
    operator: in

# Exclude specific problematic sources
dataset:
  name: text_dataset
  path: /data/texts.csv
  filter_meta:
    meta_attribute: source
    values: ["web_scraping_2021", "manual_entry"]
    operator: not in

Key Features

  • Simple and efficient (O(1) metadata lookup)
  • Supports in and not in operators
  • Works with any metadata attribute
  • Much faster than custom filter functions for metadata checks

Supported Operators

  • "in": Keep if instance.meta[attribute] is in values
  • "not in": Keep if instance.meta[attribute] is NOT in values

5. MetaAttributeDataset (meta)

Purpose: Add metadata attributes to all instances.

Source: ainxt/data/datasets/decorators.py:354-465

When to Use

  • Dataset versioning (track processing version)
  • Source tracking (identify data source)
  • Experiment tagging (add experiment IDs)
  • Split identification (mark as train/val/test)

Configuration

name: my_dataset
data_path: /path/to/data.csv

meta:
  source: "training_set"
  version: "v2.1"
  processed_date: "2024-01-15"
  experiment_id: 42
  # Any key-value pairs

Example

# Tag all instances with experiment info
dataset:
  name: image_dataset
  path: /data/images
  meta:
    dataset_version: "v3.0"
    preprocessing: "standardized"
    augmentation_applied: false
    split: "train"

Key Features

  • Adds metadata to every instance
  • Supports in-place or copy mode
  • Can use dictionary or kwargs

In-place vs Copy

# In-place (default) - modifies original instances
meta:
  inplace: true  # Default
  source: "train"

# Copy - creates deep copies (safer but uses more memory)
meta:
  inplace: false
  source: "train"

Warning: In-place modification affects the original dataset permanently. Use inplace: false if you need to preserve originals.


6. Hash-based Filtering

Hash Range (hash)

Purpose: Create deterministic dataset splits based on hash values.

Source: ainxt/data/datasets/decorators.py:621-714

Configuration

# Training set (hash ends in 0-7, ~80% of data)
dataset:
  name: my_dataset
  path: /data
  hash:
    hash_range: [0, 7]

# Test set (hash ends in 8-9, ~20% of data)
dataset:
  name: my_dataset
  path: /data
  hash:
    hash_range: [8, 9]

# Single value (hash ends in exactly 5, ~10%)
dataset:
  name: my_dataset
  path: /data
  hash:
    hash_range: 5

How It Works

  1. Each instance has a deterministic hash based on content
  2. Last N digits of hash are examined (N = digits in upper bound)
  3. Only instances with hash suffix in range are included
  4. Split is consistent across runs

Split Size Estimation

  • [0, 7]: approximately 80% of data
  • [8, 9]: approximately 20% of data
  • [0, 4]: approximately 50% of data
  • 5: approximately 10% of data (single digit)

Hash Exclusion (exclude_hash)

Purpose: Exclude specific instances by their hash values.

Source: ainxt/data/datasets/decorators.py:716-833

Configuration

# Exclude specific instances
dataset:
  name: my_dataset
  path: /data
  exclude_hash:
    sources: [12345, 67890, 11111]

# Exclude from file
dataset:
  name: my_dataset
  path: /data
  exclude_hash:
    sources: /path/to/excluded_hashes.txt

# Mixed sources
dataset:
  name: my_dataset
  path: /data
  exclude_hash:
    sources: [12345, "/path/to/more.txt", 67890]

File Format

# excluded_hashes.txt (one hash per line)
12345
67890
11111

22222
33333

When to Use

  • Exclude corrupted instances
  • Remove known duplicates
  • Filter out problematic data
  • Blacklist specific instances

Combining Decorators

Decorators are applied in the order they appear in configuration:

dataset:
  name: my_dataset
  path: /data/texts.csv

  # Step 1: Add metadata
  meta:
    dataset_version: "v2.1"
    source: "web_scraping"

  # Step 2: Filter by quality
  filter_meta:
    meta_attribute: quality_score
    values: [4, 5]
    operator: in

  # Step 3: Create train split
  hash:
    hash_range: [0, 7]  # 80% for training

  # Step 4: Apply augmentation
  augmenters:
    - name: paraphrase
      probability: 0.3
    - name: add_typos
      error_rate: 0.02

  # Step 5: Final preprocessing
  map:
    func: tokenize_and_normalize

Processing flow:

Original Dataset
    â
MetaAttributeDataset (adds metadata)
    â
FilterMetaDataset (filters by quality)
    â
Hash FilterDataset (creates train split)
    â
AugmentedDataset (applies augmentations)
    â
MapDataset (final preprocessing)
    â
Final Dataset


Best Practices

1. Order Matters

Apply decorators in logical order:

# GOOD ORDER
dataset:
  meta: ...          # 1. Add metadata first
  filter_meta: ...   # 2. Filter to reduce size
  hash: ...          # 3. Create split
  augmenters: ...    # 4. Augment (expensive)
  map: ...           # 5. Final transform

# BAD ORDER (inefficient)
dataset:
  augmenters: ...    # Augments everything (wasteful!)
  filter_meta: ...   # Then filters out augmented data

2. Filter Early

Apply filters early to reduce dataset size before expensive operations:

# GOOD
dataset:
  filter_meta: ...   # Filter first (fast)
  augmenters: ...    # Then augment (only filtered data)

# BAD
dataset:
  augmenters: ...    # Augment everything (slow)
  filter_meta: ...   # Filter after (wasted work)

3. Use Hash for Reproducibility

For train/test splits, use hash-based splitting instead of random sampling:

# GOOD - reproducible
train_dataset:
  hash:
    hash_range: [0, 7]

test_dataset:
  hash:
    hash_range: [8, 9]

# AVOID - not reproducible
# random_split: 0.8  (different every run)

4. Augmentation Only for Training

Never augment validation or test sets:

# train.yaml
dataset:
  augmenters:  # OK for training
    - name: flip

# val.yaml or test.yaml
dataset:
  # NO augmenters!  # Keep evaluation data unchanged

5. Document Your Pipeline

# GOOD - documented pipeline
dataset:
  name: text_dataset
  path: /data/texts.csv

  # Remove low-quality samples (expect ~80% to pass)
  filter_meta:
    meta_attribute: quality_score
    values: [4, 5]
    operator: in

  # Training split (80% of filtered data)
  hash:
    hash_range: [0, 7]

  # Data augmentation for training diversity
  augmenters:
    - name: synonym_replacement
      probability: 0.3

Common Patterns

Pattern 1: Training Pipeline

dataset:
  name: my_dataset
  path: /data
  filter_meta:
    meta_attribute: is_valid
    values: [true]
  hash:
    hash_range: [0, 7]
  augmenters:
    - name: augment_data

Pattern 2: Validation Pipeline

dataset:
  name: my_dataset
  path: /data
  filter_meta:
    meta_attribute: is_valid
    values: [true]
  hash:
    hash_range: [7, 8]  # Different range
  # NO augmentation!

Pattern 3: Test Pipeline

dataset:
  name: my_dataset
  path: /data
  filter_meta:
    meta_attribute: is_valid
    values: [true]
  hash:
    hash_range: [9, 9]  # Holdout set
  # NO augmentation!

Troubleshooting

Issue 1: Decorator Not Applied

Problem: Configuration has decorator key but nothing happens

Solution: Check that decorator is registered

# Ensure decorators are registered with factory
from ainxt.serving import create_dataset_factory

DATASETS = create_dataset_factory("myapp", tasks)
# This automatically registers built-in decorators

Issue 2: Wrong Decorator Order

Problem: Results don't match expectations

Solution: Check decorator application order in config

# Decorators are applied top-to-bottom
dataset:
  augmenters: ...  # Applied FIRST
  filter: ...      # Applied SECOND (might filter augmented data!)

Issue 3: Performance Issues

Problem: Dataset loading is slow

Solutions: - Filter early (before augmentation) - Specify size parameter in FilterDataset - Use filter_meta instead of custom filter when possible - Check if augmentation is too expensive


Summary

  • Dataset Decorators modify datasets without changing base classes
  • Configuration-driven through YAML keys
  • Composable - combine multiple decorators
  • Order matters - decorators apply sequentially
  • Built-in decorators cover common use cases (map, filter, augment, metadata)
  • Hash-based operations ensure reproducibility

Dataset decorators are a powerful feature that keeps your dataset classes simple while enabling complex data pipelines through configuration alone.

See Also