Core Concept: Dataset Decorators
Prerequisites
Before reading this document, you should understand: - Datasets - The base dataset system - Parsers - How configuration is transformed into objects
Dataset Decorators use both Parsers AND decorator wrapping, which can be confusing without understanding Parsers first!
What are Dataset Decorators?
Dataset Decorators are a special category of functionality that wraps and modifies datasets without changing the base dataset class. They enable powerful data transformations, filtering, and augmentation through simple YAML configuration.
Real-world analogy: Think of your base dataset as a book. Decorators are like transparent overlays you place on top: - One overlay highlights important passages (filtering) - Another adds translations in margins (mapping) - Another adds bookmarks and notes (metadata)
Each overlay modifies what you see without changing the original book.
Dataset Decorators vs Regular Parsers
Before diving in, it's crucial to understand how Dataset Decorators differ from regular Parsers:
Regular Parsers
Purpose: Create NEW objects from configuration (optimizers, loss functions, augmenters, etc.)
Flow:
Configuration (dict) → Parser → New Object Created
─────────────────────────────────────────────────────
Example:
optimizer: → OPTIMIZERS → Adam(lr=0.001)
name: adam Parser
learning_rate: 0.001
Code:
# Parser creates a NEW optimizer object
OPTIMIZERS = Factory()
OPTIMIZERS.register(None, "adam", AdamOptimizer)
# Usage
config = {"optimizer": {"name": "adam", "learning_rate": 0.001}}
parsed = parse_config(config, {"optimizer": OPTIMIZERS})
# Result: parsed["optimizer"] = Adam(learning_rate=0.001) ← NEW object
Dataset Decorators
Purpose: WRAP existing datasets to modify their behavior (not create new ones)
Flow:
Base Dataset → Decorator → Wrapped Dataset (same data, modified behavior)
──────────────────────────────────────────────────────────────────────────────
Example:
ImageNetDataset → FilterMetaDataset → ImageNetDataset (filtered)
(1000) (quality>=4) (800 instances)
Code:
# Decorator WRAPS the existing dataset
base_dataset = ImageNetDataset(path="/data") # 1000 instances
# Configuration triggers decorator
config = {
"name": "imagenet",
"path": "/data",
"filter_meta": { # ← This key triggers FilterMetaDataset decorator
"meta_attribute": "quality_score",
"values": [4, 5],
"operator": "in"
}
}
# Factory creates base dataset, THEN wraps it with decorator
dataset = DATASETS.build_from_config(config)
# Result: FilterMetaDataset(ImageNetDataset(...)) ← WRAPPED, not new
# Still 'ImageNetDataset' underneath, but filtered behavior on top
Visual Comparison
┌─────────────────────────────────────────────────────────────┐
│ REGULAR PARSER │
├─────────────────────────────────────────────────────────────┤
│ │
│ Config Dict Parser New Object │
│ ┌──────────┐ ┌──────┐ ┌─────────┐ │
│ │optimizer:│ → │ OPTS │ → │ Adam │ │
│ │ adam │ │Factory│ │ Object │ │
│ └──────────┘ └──────┘ └─────────┘ │
│ │
│ Input: Configuration │
│ Output: NEW object (optimizer, loss, scheduler, etc.) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ DATASET DECORATOR │
├─────────────────────────────────────────────────────────────┤
│ │
│ Base Dataset Decorator Wrapped Dataset │
│ ┌──────────┐ ┌─────────┐ ┌──────────────┐ │
│ │ImageNet │ → │Filter │ → │FilterMeta │ │
│ │Dataset │ │Meta │ │(ImageNet) │ │
│ │1000 imgs │ │Decorator│ │800 imgs │ │
│ └──────────┘ └─────────┘ └──────────────┘ │
│ ↑ │
│ Same dataset, │
│ modified behavior │
│ │
│ Input: Existing dataset object │
│ Output: WRAPPED dataset (same data, different iteration) │
└─────────────────────────────────────────────────────────────┘
Key Differences
| Aspect | Regular Parser | Dataset Decorator |
|---|---|---|
| Creates | New objects | Wraps existing datasets |
| Triggered by | Config key matches parser name | Config key matches decorator name |
| Applied to | Configuration values | Dataset objects |
| Result | Brand new object | Wrapped dataset |
| Examples | Optimizers, losses, callbacks | Filtering, augmentation, metadata |
| When | During config parsing | After dataset creation |
The Complete Flow
Here's how they work together:
1. CONFIGURATION
┌──────────────────────────────────────┐
│ dataset: │
│ name: imagenet │
│ path: /data │
│ augmenters: ← Parser (creates) │
│ - name: flip │
│ filter_meta: ← Decorator (wraps) │
│ quality: 4 │
└──────────────────────────────────────┘
↓
2. REGULAR PARSER (parse_config)
Creates augmenter object from config
┌──────────────────────────────────────┐
│ augmenters: [FlipAugmenter()] │
│ name: imagenet │
│ path: /data │
│ filter_meta: {quality: 4} │
└──────────────────────────────────────┘
↓
3. FACTORY CREATES BASE DATASET
┌──────────────────────────────────────┐
│ base = ImageNetDataset( │
│ path="/data", │
│ augmenters=[FlipAugmenter()] │
│ ) │
└──────────────────────────────────────┘
↓
4. DATASET DECORATOR WRAPS
┌──────────────────────────────────────┐
│ wrapped = FilterMetaDataset( │
│ base, │
│ meta_attribute="quality", │
│ values=[4] │
│ ) │
└──────────────────────────────────────┘
↓
5. FINAL RESULT
┌──────────────────────────────────────┐
│ FilterMetaDataset │
│ └─> ImageNetDataset │
│ └─> with FlipAugmenter │
│ │
│ Behavior: Iterates only over │
│ high-quality images, applying flip │
└──────────────────────────────────────┘
Summary: - Parsers create components that get passed TO datasets/models - Dataset Decorators wrap datasets to modify HOW they behave
How Are Decorators Applied? (The Mechanics)
This is the crucial part that ties everything together. Let's trace through the EXACT code path to understand WHEN and HOW decorators are applied.
The Three Patterns in aiNXT
There are actually THREE different patterns that work together:
| Aspect | Simple Decorators | Decorators with Parsers | Pure Parsed Arguments |
|---|---|---|---|
| Examples | filter_meta, hash, meta |
augmenters, map (with func) |
optimizer, loss_function (in models) |
| Has Decorator? | ✓ YES | ✓ YES | ✗ NO |
| Has Parser? | ✗ NO | ✓ YES | ✓ YES |
| Registered where | factory.register_decorator() |
Both decorator registry AND PARSERS |
Only PARSERS dict |
| Parse phase | N/A | parse_config() creates sub-objects |
parse_config() creates objects |
| Wrap phase | Factory.build() wraps dataset |
Factory.build() wraps with parsed objects |
N/A - passed to constructor |
| Decorator gets | Simple values (dict/string) | Parsed objects | N/A |
| Final result | Wrapped dataset | Wrapped dataset with objects | Base dataset/model with objects |
The Key Insight: augmenters uses BOTH a parser AND a decorator working together!
Pattern 1: Simple Decorators (filter_meta, hash, meta)
Setup Phase (Happens Once)
# File: ainxt/serving/singletons.py
def create_dataset_factory(package, tasks):
"""Create factory with decorators."""
# 1. Create loader for dataset classes
dataset_loader = Loader(template=f"{package}.data.datasets.{{task}}", ...)
# 2. Create loader for DECORATORS
decorator_loader = Loader(
template=f"{package}.data.datasets.{{task}}.decorators", # ← Finds decorators!
tasks=tasks
)
# 3. Create factory and register decorators
factory = Factory(dataset_loader)
factory.register_decorator(decorator_loader) # ← KEY: Registers decorators!
return factory
# Result: Factory has internal decorator registry
# factory.decorators = [
# {(None, "filter_meta"): FilterMetaDataset},
# {(None, "map"): MapDataset},
# {(None, "hash"): HashFilterDataset},
# ...
# ]
Execution Phase (Every time you load a dataset)
# User configuration
config = {
"name": "imagenet",
"path": "/data",
"filter_meta": { # ← Decorator trigger!
"meta_attribute": "quality",
"values": [4, 5]
}
}
# User calls
dataset = CONTEXT.load_dataset(config)
Step-by-step execution:
┌──────────────────────────────────────────────────────────────────┐
│ 1. CONTEXT.load_dataset(config) │
├──────────────────────────────────────────────────────────────────┤
│ → Calls: DATASETS.build(**config) │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 2. Factory.build(name="imagenet", path="/data", │
│ filter_meta={...}) │
├──────────────────────────────────────────────────────────────────┤
│ → Looks up constructor: DATASETS[("classification", "imagenet")]│
│ → Returns: WRAPPED function (not raw constructor!) │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 3. Factory.__getitem__(("classification", "imagenet")) │
│ [ainxt/factory/factory.py:165-237] │
├──────────────────────────────────────────────────────────────────┤
│ # Get the raw constructor │
│ constructor = ImageNetDataset.__init__ │
│ │
│ # Create wrapper function │
│ def wrapper(**kwargs): │
│ # Inspect constructor signature │
│ needed_args = ["path"] # ImageNetDataset needs "path" │
│ │
│ # Find extra keys (decorator triggers!) │
│ extra_keys = set(kwargs) - set(needed_args) │
│ # extra_keys = {"filter_meta"} │
│ │
│ # Check if extra keys match decorators │
│ if "filter_meta" in self.decorators: │
│ # FOUND DECORATOR! │
│ # Wrap constructor with decorator │
│ constructor = wrap_with_filter_meta(constructor) │
│ │
│ return constructor(**kwargs) │
│ │
│ return wrapper # ← Returns WRAPPED function │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 4. wrapper(name="imagenet", path="/data", filter_meta={...}) │
├──────────────────────────────────────────────────────────────────┤
│ # Pop decorator args │
│ filter_args = kwargs.pop("filter_meta") │
│ # kwargs = {"name": "imagenet", "path": "/data"} │
│ │
│ # Call original constructor │
│ base = ImageNetDataset(path="/data") │
│ │
│ # Apply decorator │
│ wrapped = FilterMetaDataset(base, **filter_args) │
│ │
│ return wrapped │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ RESULT: User gets FilterMetaDataset(ImageNetDataset(...)) │
└──────────────────────────────────────────────────────────────────┘
Key points:
1. Decorator names MUST match config keys: "filter_meta" in config → "filter_meta" decorator
2. Factory automatically detects extra kwargs not needed by constructor
3. Wrapping happens invisibly during Factory.__getitem__()
4. User never sees the wrapping - it's automatic!
Pattern 2: Decorators with Parsers (augmenters, map with func)
This is the most complex pattern - it uses BOTH a parser AND a decorator!
The augmenters case:
- AUGMENTERS Parser (in PARSERS dict) - Creates individual augmenter objects
- AugmentedDataset Decorator (in decorator registry) - Wraps dataset with those objects
Setup Phase
# File: myapp/parsers/augmenter.py
# This creates INDIVIDUAL augmenter objects from config
AUGMENTERS = Factory()
AUGMENTERS.register(None, "flip", FlipAugmenter)
AUGMENTERS.register(None, "rotate", RotateAugmenter)
# File: myapp/serving/singletons.py
# Register as PARSER (for creating augmenter objects)
PARSERS = {
"augmenters": AUGMENTERS # ← Parser for augmenter objects
}
# File: ainxt/data/datasets/decorators.py (line 87)
# The DECORATOR is built-in to aiNXT
@builder_name("augmenters") # ← Decorator with same name!
class AugmentedDataset(MapDataset):
def __init__(self, dataset, augmenters):
# Receives PARSED augmenter objects
# Wraps the dataset to apply them
...
# File: ainxt/serving/singletons.py
# Decorator auto-registered by Loader
decorator_loader = Loader(
template="ainxt.data.datasets.decorators",
# Finds AugmentedDataset class
)
DATASETS.register_decorator(decorator_loader)
Execution Phase
# User configuration
config = {
"name": "imagenet",
"path": "/data",
"augmenters": [ # ← Triggers BOTH parser AND decorator!
{"name": "flip"},
{"name": "rotate", "degrees": 15}
]
}
dataset = CONTEXT.load_dataset(config)
Step-by-step execution:
┌──────────────────────────────────────────────────────────────────┐
│ 1. CONTEXT.load_dataset(config) │
├──────────────────────────────────────────────────────────────────┤
│ → Calls: Context._build(DATASETS, config) │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 2. Context._build() - FIRST calls parse_config() │
│ [ainxt/scripts/context.py:96-109] │
├──────────────────────────────────────────────────────────────────┤
│ parsed = parse_config(config, PARSERS) │
│ │
│ # Check: Is "augmenters" in PARSERS? │
│ # YES! Use AUGMENTERS parser │
│ │
│ for aug_config in config["augmenters"]: │
│ # Build each augmenter object │
│ aug = AUGMENTERS.build(name="flip") │
│ # aug = FlipAugmenter() │
│ │
│ # Replace config value with parsed objects │
│ parsed = { │
│ "name": "imagenet", │
│ "path": "/data", │
│ "augmenters": [FlipAugmenter(), RotateAugmenter()] ← OBJECTS│
│ } │
│ │
│ return DATASETS.build(**parsed) │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 3. Factory.build() - Gets wrapped constructor │
│ [ainxt/factory/factory.py:165-237] │
├──────────────────────────────────────────────────────────────────┤
│ kwargs = {"name": "imagenet", "path": "/data", │
│ "augmenters": [FlipAugmenter(), ...]} ← Objects now! │
│ │
│ # Find constructor │
│ constructor = ImageNetDataset.__init__ │
│ │
│ # Check constructor signature │
│ needed_args = ["path"] # ImageNetDataset(path) │
│ │
│ # Find extra keys │
│ extra_keys = {"augmenters"} │
│ │
│ # Check: Is "augmenters" a decorator? │
│ # YES! Found AugmentedDataset decorator │
│ # (registered by decorator_loader) │
│ │
│ # Wrap constructor with AugmentedDataset decorator │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 4. Wrapped constructor executes │
├──────────────────────────────────────────────────────────────────┤
│ # Pop augmenters from kwargs │
│ augmenter_objects = kwargs.pop("augmenters") │
│ # augmenter_objects = [FlipAugmenter(), RotateAugmenter()] │
│ │
│ # kwargs = {"path": "/data"} │
│ │
│ # Create base dataset │
│ base = ImageNetDataset(path="/data") │
│ │
│ # Apply AugmentedDataset decorator with parsed objects │
│ wrapped = AugmentedDataset( │
│ base, │
│ augmenters=[FlipAugmenter(), RotateAugmenter()] ← Objects! │
│ ) │
│ │
│ return wrapped │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ RESULT │
├──────────────────────────────────────────────────────────────────┤
│ dataset = AugmentedDataset( │
│ ImageNetDataset(path="/data"), │
│ augmenters=[FlipAugmenter(), RotateAugmenter()] │
│ ) │
│ │
│ It's WRAPPED (decorator) AND receives PARSED OBJECTS! │
└──────────────────────────────────────────────────────────────────┘
Key points: 1. TWO registrations with same name "augmenters": - AUGMENTERS in PARSERS → Creates augmenter objects - AugmentedDataset as decorator → Wraps dataset 2. TWO-PHASE process: - Phase 1 (parse_config): List of dicts → List of augmenter objects - Phase 2 (Factory.build): Base dataset → Wrapped with augmenter objects 3. The decorator receives parsed objects, not raw config!
What ARE Augmenters?
Augmenters are classes that transform data on-the-fly (flip, rotate, blur, etc.). They inherit from the Augmenter base class and have a __call__ method making them callable like functions.
For complete details on Augmenters, see Augmenters.
Quick example:
# Augmenter is a class configured once, called many times
flip_augmenter = FlipAugmenter(direction="horizontal")
rotated_augmenter = RotateAugmenter(max_degrees=15)
# Parser creates these from config
# Decorator wraps dataset with them
# Iteration applies them automatically
Pattern 3: Pure Parsed Arguments (optimizer, loss_function)
This pattern has NO decorator - objects are only passed to constructor.
Setup Phase
# File: myapp/parsers/augmenter.py
AUGMENTERS = Factory()
AUGMENTERS.register(None, "flip", FlipAugmenter)
AUGMENTERS.register(None, "rotate", RotateAugmenter)
# File: myapp/serving/singletons.py
PARSERS = {
"augmenters": AUGMENTERS, # ← Registered as PARSER, not decorator!
"optimizer": OPTIMIZERS,
"loss_function": LOSSES
}
# File: myapp/context.py
CONTEXT = Context(
dataset_builder=DATASETS,
parsers=PARSERS # ← Parsers passed to Context!
)
Execution Phase
# User configuration
config = {
"name": "imagenet",
"path": "/data",
"augmenters": [ # ← Parser trigger (NOT decorator!)
{"name": "flip"},
{"name": "rotate", "degrees": 15}
]
}
# User calls
dataset = CONTEXT.load_dataset(config)
Step-by-step execution:
┌──────────────────────────────────────────────────────────────────┐
│ 1. CONTEXT.load_dataset(config) │
│ [ainxt/scripts/context.py:49-58] │
├──────────────────────────────────────────────────────────────────┤
│ → Calls: self._build(self.dataset_builder, config) │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 2. Context._build(builder, config) │
│ [ainxt/scripts/context.py:96-109] │
├──────────────────────────────────────────────────────────────────┤
│ # FIRST: Parse configuration │
│ parsed_config = parse_config(config, self.parsers) │
│ │
│ # THEN: Build with parsed config │
│ return builder.build(**parsed_config) │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 3. parse_config(config, PARSERS) │
│ [ainxt/serving/config.py:12-79] │
├──────────────────────────────────────────────────────────────────┤
│ for key, value in config.items(): │
│ if key in parsers: # Check if "augmenters" in PARSERS │
│ # FOUND PARSER! │
│ # Transform the value │
│ if isinstance(value, list): │
│ # Parse each augmenter config │
│ augmenter_objects = [] │
│ for aug_config in value: │
│ aug = parsers[key].build(**aug_config) │
│ augmenter_objects.append(aug) │
│ config[key] = augmenter_objects │
│ │
│ return config │
│ │
│ # Result: config is MODIFIED │
│ # config["augmenters"] = [FlipAugmenter(), RotateAugmenter()] │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ 4. DATASETS.build(name="imagenet", path="/data", │
│ augmenters=[FlipAugmenter(), ...]) │
├──────────────────────────────────────────────────────────────────┤
│ # Factory checks: Is "augmenters" a decorator? │
│ # NO! It's not in factory.decorators │
│ │
│ # Check constructor signature │
│ # ImageNetDataset.__init__(path, augmenters=None) │
│ # "augmenters" IS a constructor parameter! │
│ │
│ # Pass augmenters TO constructor (no wrapping!) │
│ base = ImageNetDataset( │
│ path="/data", │
│ augmenters=[FlipAugmenter(), RotateAugmenter()] │
│ ) │
│ │
│ return base # ← Returns base dataset, NOT wrapped! │
└───────────────────────────────┬──────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ RESULT: User gets ImageNetDataset with augmenters stored inside │
│ The dataset applies augmentation internally during iteration │
└──────────────────────────────────────────────────────────────────┘
Key points:
1. Parsing happens BEFORE factory.build() in Context._build()
2. parse_config() transforms dicts into actual objects
3. Objects are passed as regular constructor arguments
4. NO wrapping occurs - dataset uses them internally
5. The constructor signature must accept these parameters!
Visual Comparison
TRUE DECORATORS (filter_meta, map, hash):
┌────────────┐
│ Config │
└──────┬─────┘
↓
┌──────────────┐ ┌─────────────┐
│ Factory. │ → │ Base │
│ build() │ │ Dataset │
└──────────────┘ └──────┬──────┘
↓ ↓
┌──────────────┐ ┌─────────────┐
│ Detects │ → │ Wraps with │
│ extra kwargs │ │ Decorator │
└──────────────┘ └──────┬──────┘
↓
┌─────────────┐
│ Wrapped │
│ Dataset │
└─────────────┘
PARSED ARGUMENTS (augmenters, optimizer):
┌────────────┐
│ Config │
└──────┬─────┘
↓
┌──────────────┐ ┌─────────────┐
│ parse_ │ → │ Creates │
│ config() │ │ Objects │
└──────────────┘ └──────┬──────┘
↓ ↓
┌──────────────┐ ┌─────────────┐
│ Factory. │ → │ Passes to │
│ build() │ │ Constructor │
└──────────────┘ └──────┬──────┘
↓
┌─────────────┐
│ Base │
│ Dataset │
│ (with │
│ objects) │
└─────────────┘
How to Tell the Difference?
Check 1: Is it in PARSERS?
Check 2: When is it processed?
parse_config() # Parsed arguments processed here (before build)
Factory.build() # Decorators processed here (during build)
Check 3: What's the constructor signature?
def __init__(self, path, augmenters=None): # augmenters → Parsed argument
def __init__(self, path): # No filter_meta → Decorator (added after)
Summary Table
| Feature | True Decorators | Parsed Arguments |
|---|---|---|
| Config Key | filter_meta, map, hash |
augmenters, optimizer |
| Registered | factory.register_decorator() |
In PARSERS dict |
| When Processed | During Factory.build() |
During parse_config() |
| How Applied | Wraps after construction | Passed to constructor |
| Constructor Knows | No | Yes (has parameter) |
| Final Result | Wrapped dataset | Base dataset with objects |
Why Use Decorators?
Without Decorators (Bad)
class MyDatasetWithAugmentation(Dataset):
def __init__(self, path, augment=False, filter_quality=False, add_metadata=False):
self.path = path
self.augment = augment
self.filter_quality = filter_quality
# Complex logic mixing concerns...
def __getitem__(self, idx):
item = self.load_item(idx)
if self.filter_quality and not self.is_quality(item):
# What to return here???
if self.augment:
item = self.augment_item(item)
if self.add_metadata:
item.meta.update(...)
return item
With Decorators (Good)
# Base dataset stays simple
class MyDataset(Dataset):
def __init__(self, path):
self.path = path
def __getitem__(self, idx):
return self.load_item(idx) # Just load, nothing else!
# All transformations in config
dataset:
name: my_dataset
path: /data
filter_meta: # Decorator 1
meta_attribute: quality_score
values: [4, 5]
operator: in
augmenters: # Decorator 2
- name: flip
meta: # Decorator 3
source: "train_v2"
Available Decorators
aiNXT provides several built-in decorators in ainxt/data/datasets/decorators.py:
| Decorator | Config Key | Purpose |
|---|---|---|
| MapDataset | map |
Apply custom transformation to each instance |
| AugmentedDataset | augmenters, augmenter |
Apply data augmentation |
| FilterDataset | filter |
Remove instances based on custom criteria |
| FilterMetaDataset | filter_meta |
Filter by metadata attributes |
| MetaAttributeDataset | meta |
Add metadata to all instances |
| Hash-based filters | hash, exclude_hash |
Deterministic splitting and exclusion |
1. MapDataset (map)
Purpose: Apply a custom transformation function to each instance.
Source: ainxt/data/datasets/decorators.py:17-83
When to Use
- Custom preprocessing (normalize text, resize images, extract features)
- Data type conversions
- Feature engineering
- Final transformations before model input
How It Works - Visual Flow
┌─────────────────────────────────────────────────────────────────┐
│ CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ dataset: │
│ name: text_dataset │
│ path: /data/texts.csv │
│ map: ← Decorator key │
│ func: tokenize_and_normalize │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Factory creates base dataset │
├─────────────────────────────────────────────────────────────────┤
│ │
│ base_dataset = TextDataset(path="/data/texts.csv") │
│ │
│ Instances: │
│ [0]: Instance(data="Hello World") │
│ [1]: Instance(data="GOODBYE Python") │
│ [2]: Instance(data=" Machine Learning ") │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: MapDataset wraps base dataset │
├─────────────────────────────────────────────────────────────────┤
│ │
│ wrapped = MapDataset( │
│ base_dataset, │
│ func=tokenize_and_normalize ← Transform function │
│ ) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: When iterating, applies function to each instance │
├─────────────────────────────────────────────────────────────────┤
│ │
│ for instance in wrapped: │
│ # Gets from base, applies function, yields result │
│ │
│ [0]: "Hello World" → tokenize_and_normalize → ["hello", "world"]│
│ [1]: "GOODBYE Python" → tokenize_and_normalize → ["goodbye", "python"]│
│ [2]: " Machine Learning " → tokenize_and_normalize → ["machine", "learning"]│
└─────────────────────────────────────────────────────────────────┘
Configuration
name: my_dataset
data_path: /path/to/data.csv
map:
func: my_preprocessing_function # Must be registered
Example
# Register transformation function
def tokenize_and_normalize(instance):
"""Tokenize text and normalize."""
instance.data = instance.data.lower().strip().split()
return instance
# Register with factory
TRANSFORMATIONS = Factory()
TRANSFORMATIONS.register(None, "tokenize_and_normalize", tokenize_and_normalize)
# Use in config
```yaml
dataset:
name: text_dataset
path: /data/texts.csv
map:
func: tokenize_and_normalize
Key Features
- Lazy evaluation (applied during iteration)
- Chain multiple map operations
- Access full instance (data + metadata)
2. AugmentedDataset (augmenters)
Purpose: Apply data augmentation transformations.
Source: ainxt/data/datasets/decorators.py:85-162
When to Use
- Image augmentation (flipping, rotation, color adjustment)
- Text augmentation (paraphrasing, typos)
- Audio augmentation (noise, tempo changes)
- Training data diversity
How It Works - Visual Flow
┌─────────────────────────────────────────────────────────────────┐
│ CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ dataset: │
│ name: image_dataset │
│ path: /data/images │
│ augmenters: ← Decorator key (note: parsed by AUGMENTERS) │
│ - name: flip │
│ probability: 0.5 │
│ - name: rotate │
│ degrees: 15 │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Regular parser creates augmenter objects │
├─────────────────────────────────────────────────────────────────┤
│ │
│ augmenters = [ │
│ FlipAugmenter(probability=0.5), ← NEW object created │
│ RotateAugmenter(degrees=15) ← NEW object created │
│ ] │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: Factory creates base dataset WITH parsed augmenters │
├─────────────────────────────────────────────────────────────────┤
│ │
│ base_dataset = ImageDataset( │
│ path="/data/images", │
│ augmenters=[FlipAugmenter(...), RotateAugmenter(...)] │
│ ) │
│ │
│ Original Instances (100 images): │
│ [0]: Instance(data=<cat.jpg>) │
│ [1]: Instance(data=<dog.jpg>) │
│ ... │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: When iterating, applies augmenters in sequence │
├─────────────────────────────────────────────────────────────────┤
│ │
│ for instance in base_dataset: │
│ # Applies each augmenter in order │
│ │
│ Instance flow: │
│ Original: <cat.jpg> │
│ ↓ │
│ FlipAugmenter (p=0.5): <cat.jpg> or <flipped_cat.jpg> │
│ ↓ │
│ RotateAugmenter: <rotated_15deg> │
│ ↓ │
│ Final: <augmented_cat.jpg> │
│ │
│ Note: Augmentation happens during iteration, not wrapping! │
└─────────────────────────────────────────────────────────────────┘
Important: Unlike other decorators, augmenters uses BOTH:
1. Regular Parser (AUGMENTERS) - Creates augmenter objects from config
2. Dataset Constructor - Receives augmenter objects as parameters
The augmentation is applied internally by the dataset, not by a wrapper decorator!
Regular Decorator Pattern: Augmenter Pattern:
┌─────────────┐ ┌─────────────┐
│ Decorator │ │ Parser │
│ wraps │ │ creates │
│ Dataset │ │ Augmenter │
└─────────────┘ └─────────────┘
↓ ↓
┌─────────────┐ ┌─────────────┐
│ Base │ │ Dataset │
│ Dataset │ │ applies it │
│ inside │ │ internally │
└─────────────┘ └─────────────┘
Configuration
name: my_dataset
data_path: /path/to/data.csv
# Single augmenter
augmenters:
- name: flip_horizontal
# Multiple augmenters (applied in sequence)
augmenters:
- name: flip_horizontal
probability: 0.5
- name: rotate
angle: 15
- name: add_gaussian_noise
std: 0.05
Example
# Define augmenters
from ainxt.data.augmentation import Augmenter
class RandomFlip(Augmenter):
def __init__(self, probability=0.5):
self.probability = probability
def __call__(self, instance):
if random.random() < self.probability:
instance.data = flip(instance.data)
return instance
# Register
AUGMENTERS = Factory()
AUGMENTERS.register("image", "flip_horizontal", RandomFlip)
# Use in config
```yaml
dataset:
task: image
name: imagenet
augmenters:
- task: image
name: flip_horizontal
probability: 0.7
Key Features
- Sequential application (chained together)
- Supports single augmenter or list
- Lazy evaluation (computed during training)
- Uses
chain()utility to combine augmenters
Best Practices
- Apply only to training data, not validation/test
- Order from least to most destructive
- Use probability for stochastic augmentation
- Balance diversity vs. training speed
3. FilterDataset (filter)
Purpose: Remove instances based on custom criteria.
Source: ainxt/data/datasets/decorators.py:164-250
When to Use
- Quality filtering (remove corrupted data)
- Content filtering (specific criteria)
- Size filtering (too large/small)
- Custom business logic
Configuration
name: my_dataset
data_path: /path/to/data.csv
filter:
func: quality_filter # Returns True to keep, False to remove
size: 1000 # Optional: expected size after filtering
Example
# Define filter function
def quality_filter(instance):
"""Keep only high-quality instances."""
return instance.meta.get('quality_score', 0) >= 7.0
# Register
FILTERS = Factory()
FILTERS.register(None, "quality_filter", quality_filter)
# Use in config
```yaml
dataset:
name: document_dataset
path: /data/docs
filter:
func: quality_filter
size: 800 # Helps with length calculation
Key Features
- Lazy filtering (during iteration)
- Optional
sizeparameter for performance - Without
size, requires full iteration to compute length
Performance Tip
Specify size if you know the expected filtered dataset size to avoid iterating through the entire dataset just to get its length.
4. FilterMetaDataset (filter_meta)
Purpose: Filter instances by metadata attributes (simplified filtering).
Source: ainxt/data/datasets/decorators.py:252-352
When to Use
- Category filtering (include/exclude specific labels)
- Quality filtering (by scores or ratings)
- Source filtering (by data source)
- Any metadata-based filtering
How It Works - Visual Flow
┌─────────────────────────────────────────────────────────────────┐
│ CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ dataset: │
│ name: document_dataset │
│ path: /data/docs.csv │
│ filter_meta: ← Decorator key │
│ meta_attribute: quality_score │
│ values: [4, 5] │
│ operator: in │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Factory creates base dataset │
├─────────────────────────────────────────────────────────────────┤
│ │
│ base_dataset = DocumentDataset(path="/data/docs.csv") │
│ │
│ All Instances (1000 total): │
│ [0]: Instance(data="Doc A", meta={quality_score: 5}) ✓ │
│ [1]: Instance(data="Doc B", meta={quality_score: 2}) ✗ │
│ [2]: Instance(data="Doc C", meta={quality_score: 4}) ✓ │
│ [3]: Instance(data="Doc D", meta={quality_score: 1}) ✗ │
│ [4]: Instance(data="Doc E", meta={quality_score: 5}) ✓ │
│ ... │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: FilterMetaDataset wraps base dataset │
├─────────────────────────────────────────────────────────────────┤
│ │
│ wrapped = FilterMetaDataset( │
│ base_dataset, │
│ meta_attribute="quality_score", │
│ values=[4, 5], ← Only keep these scores │
│ operator="in" │
│ ) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: When iterating, filters based on metadata │
├─────────────────────────────────────────────────────────────────┤
│ │
│ for instance in wrapped: │
│ # Only yields instances where quality_score in [4, 5] │
│ │
│ Filtered Instances (800 remaining): │
│ [0]: Instance(data="Doc A", meta={quality_score: 5}) ✓ KEPT │
│ [1]: Instance(data="Doc B", meta={quality_score: 2}) SKIPPED│
│ [2]: Instance(data="Doc C", meta={quality_score: 4}) ✓ KEPT │
│ [3]: Instance(data="Doc D", meta={quality_score: 1}) SKIPPED│
│ [4]: Instance(data="Doc E", meta={quality_score: 5}) ✓ KEPT │
│ ... │
│ │
│ Result: len(wrapped) = 800 (was 1000) │
└─────────────────────────────────────────────────────────────────┘
Decision Logic:
For each instance:
┌─────────────────────────────────────┐
│ Get meta[quality_score] │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ Is value in [4, 5]? │
└──────┬──────────────────┬───────────┘
↓ YES ↓ NO
┌──────────┐ ┌──────────┐
│ YIELD │ │ SKIP │
│ instance │ │ instance │
└──────────┘ └──────────┘
Configuration
name: my_dataset
data_path: /path/to/data.csv
# Keep instances where category is in specified values
filter_meta:
meta_attribute: category
values: ["positive", "neutral"]
operator: in # or "not in"
# Exclude low-quality samples
filter_meta:
meta_attribute: quality_score
values: [1, 2]
operator: not in
Example
# Keep only validated documents
dataset:
name: document_dataset
path: /data/docs.csv
filter_meta:
meta_attribute: is_validated
values: [true]
operator: in
# Exclude specific problematic sources
dataset:
name: text_dataset
path: /data/texts.csv
filter_meta:
meta_attribute: source
values: ["web_scraping_2021", "manual_entry"]
operator: not in
Key Features
- Simple and efficient (O(1) metadata lookup)
- Supports
inandnot inoperators - Works with any metadata attribute
- Much faster than custom filter functions for metadata checks
Supported Operators
"in": Keep ifinstance.meta[attribute]is in values"not in": Keep ifinstance.meta[attribute]is NOT in values
5. MetaAttributeDataset (meta)
Purpose: Add metadata attributes to all instances.
Source: ainxt/data/datasets/decorators.py:354-465
When to Use
- Dataset versioning (track processing version)
- Source tracking (identify data source)
- Experiment tagging (add experiment IDs)
- Split identification (mark as train/val/test)
Configuration
name: my_dataset
data_path: /path/to/data.csv
meta:
source: "training_set"
version: "v2.1"
processed_date: "2024-01-15"
experiment_id: 42
# Any key-value pairs
Example
# Tag all instances with experiment info
dataset:
name: image_dataset
path: /data/images
meta:
dataset_version: "v3.0"
preprocessing: "standardized"
augmentation_applied: false
split: "train"
Key Features
- Adds metadata to every instance
- Supports in-place or copy mode
- Can use dictionary or kwargs
In-place vs Copy
# In-place (default) - modifies original instances
meta:
inplace: true # Default
source: "train"
# Copy - creates deep copies (safer but uses more memory)
meta:
inplace: false
source: "train"
Warning: In-place modification affects the original dataset permanently. Use inplace: false if you need to preserve originals.
6. Hash-based Filtering
Hash Range (hash)
Purpose: Create deterministic dataset splits based on hash values.
Source: ainxt/data/datasets/decorators.py:621-714
Configuration
# Training set (hash ends in 0-7, ~80% of data)
dataset:
name: my_dataset
path: /data
hash:
hash_range: [0, 7]
# Test set (hash ends in 8-9, ~20% of data)
dataset:
name: my_dataset
path: /data
hash:
hash_range: [8, 9]
# Single value (hash ends in exactly 5, ~10%)
dataset:
name: my_dataset
path: /data
hash:
hash_range: 5
How It Works
- Each instance has a deterministic hash based on content
- Last N digits of hash are examined (N = digits in upper bound)
- Only instances with hash suffix in range are included
- Split is consistent across runs
Split Size Estimation
[0, 7]: approximately 80% of data[8, 9]: approximately 20% of data[0, 4]: approximately 50% of data5: approximately 10% of data (single digit)
Hash Exclusion (exclude_hash)
Purpose: Exclude specific instances by their hash values.
Source: ainxt/data/datasets/decorators.py:716-833
Configuration
# Exclude specific instances
dataset:
name: my_dataset
path: /data
exclude_hash:
sources: [12345, 67890, 11111]
# Exclude from file
dataset:
name: my_dataset
path: /data
exclude_hash:
sources: /path/to/excluded_hashes.txt
# Mixed sources
dataset:
name: my_dataset
path: /data
exclude_hash:
sources: [12345, "/path/to/more.txt", 67890]
File Format
When to Use
- Exclude corrupted instances
- Remove known duplicates
- Filter out problematic data
- Blacklist specific instances
Combining Decorators
Decorators are applied in the order they appear in configuration:
dataset:
name: my_dataset
path: /data/texts.csv
# Step 1: Add metadata
meta:
dataset_version: "v2.1"
source: "web_scraping"
# Step 2: Filter by quality
filter_meta:
meta_attribute: quality_score
values: [4, 5]
operator: in
# Step 3: Create train split
hash:
hash_range: [0, 7] # 80% for training
# Step 4: Apply augmentation
augmenters:
- name: paraphrase
probability: 0.3
- name: add_typos
error_rate: 0.02
# Step 5: Final preprocessing
map:
func: tokenize_and_normalize
Processing flow:
Original Dataset
â
MetaAttributeDataset (adds metadata)
â
FilterMetaDataset (filters by quality)
â
Hash FilterDataset (creates train split)
â
AugmentedDataset (applies augmentations)
â
MapDataset (final preprocessing)
â
Final Dataset
Best Practices
1. Order Matters
Apply decorators in logical order:
# GOOD ORDER
dataset:
meta: ... # 1. Add metadata first
filter_meta: ... # 2. Filter to reduce size
hash: ... # 3. Create split
augmenters: ... # 4. Augment (expensive)
map: ... # 5. Final transform
# BAD ORDER (inefficient)
dataset:
augmenters: ... # Augments everything (wasteful!)
filter_meta: ... # Then filters out augmented data
2. Filter Early
Apply filters early to reduce dataset size before expensive operations:
# GOOD
dataset:
filter_meta: ... # Filter first (fast)
augmenters: ... # Then augment (only filtered data)
# BAD
dataset:
augmenters: ... # Augment everything (slow)
filter_meta: ... # Filter after (wasted work)
3. Use Hash for Reproducibility
For train/test splits, use hash-based splitting instead of random sampling:
# GOOD - reproducible
train_dataset:
hash:
hash_range: [0, 7]
test_dataset:
hash:
hash_range: [8, 9]
# AVOID - not reproducible
# random_split: 0.8 (different every run)
4. Augmentation Only for Training
Never augment validation or test sets:
# train.yaml
dataset:
augmenters: # OK for training
- name: flip
# val.yaml or test.yaml
dataset:
# NO augmenters! # Keep evaluation data unchanged
5. Document Your Pipeline
# GOOD - documented pipeline
dataset:
name: text_dataset
path: /data/texts.csv
# Remove low-quality samples (expect ~80% to pass)
filter_meta:
meta_attribute: quality_score
values: [4, 5]
operator: in
# Training split (80% of filtered data)
hash:
hash_range: [0, 7]
# Data augmentation for training diversity
augmenters:
- name: synonym_replacement
probability: 0.3
Common Patterns
Pattern 1: Training Pipeline
dataset:
name: my_dataset
path: /data
filter_meta:
meta_attribute: is_valid
values: [true]
hash:
hash_range: [0, 7]
augmenters:
- name: augment_data
Pattern 2: Validation Pipeline
dataset:
name: my_dataset
path: /data
filter_meta:
meta_attribute: is_valid
values: [true]
hash:
hash_range: [7, 8] # Different range
# NO augmentation!
Pattern 3: Test Pipeline
dataset:
name: my_dataset
path: /data
filter_meta:
meta_attribute: is_valid
values: [true]
hash:
hash_range: [9, 9] # Holdout set
# NO augmentation!
Troubleshooting
Issue 1: Decorator Not Applied
Problem: Configuration has decorator key but nothing happens
Solution: Check that decorator is registered
# Ensure decorators are registered with factory
from ainxt.serving import create_dataset_factory
DATASETS = create_dataset_factory("myapp", tasks)
# This automatically registers built-in decorators
Issue 2: Wrong Decorator Order
Problem: Results don't match expectations
Solution: Check decorator application order in config
# Decorators are applied top-to-bottom
dataset:
augmenters: ... # Applied FIRST
filter: ... # Applied SECOND (might filter augmented data!)
Issue 3: Performance Issues
Problem: Dataset loading is slow
Solutions:
- Filter early (before augmentation)
- Specify size parameter in FilterDataset
- Use filter_meta instead of custom filter when possible
- Check if augmentation is too expensive
Summary
- Dataset Decorators modify datasets without changing base classes
- Configuration-driven through YAML keys
- Composable - combine multiple decorators
- Order matters - decorators apply sequentially
- Built-in decorators cover common use cases (map, filter, augment, metadata)
- Hash-based operations ensure reproducibility
Dataset decorators are a powerful feature that keeps your dataset classes simple while enabling complex data pipelines through configuration alone.
See Also
- Datasets and Models - Base classes for datasets
- Parsers - How decorators are triggered
- Factory - How decorators integrate with factories