Training Data Generation

This section describes how AMLRO incorporates experimental or computational feedback to build the training dataset used for active learning.

This step bridges reaction space definition and model-driven optimization.

Overview

After generating the initial reaction conditions (training_combo.csv), objective values must be provided before AMLRO can train a model.

The training data generation step:

Collects reaction conditions and objective values
Builds a cumulative training dataset
Supports both open-loop (manual/inetractive notebook) and closed-loop (automated) workflows

This functionality is accessed through the generate_training_data entry-point function.

Entry-Point Function

generate_training_data(
    exp_dir=exp_dir,
    config=config,
    parameters=parameters,
    obj_values=objectives,
    filename='reactions_data.csv',
    termination=False
)

This function is designed to be called iteratively, once objective values become available.

Generated Files

During training data generation, AMLRO creates or updates the following files inside exp_dir, filename can be defined by user and default is reactions_data.csv:

reactions_data.csv Encoded dataset used for machine learning model training
reactions_data_decoded.csv Human-readable version of the training dataset

These files grow incrementally as new experimental or computational results are added.

Open-Loop Workflow (Manual Update)

This option is recommended for experimental workflows or expensive simulations where objective values are not available programmatically.

Workflow

Perform experiments or simulations for the conditions listed in training_combo.csv
Manually create or update reactions_data.csv
Proceed to active learning once sufficient data is available

File Format Requirements

When creating or editing reactions_data.csv manually:

Include only: - Feature columns defined in config["continuous"]["feature_names"] - Feature columns defined in config["categorical"]["feature_names"] - Objective columns defined in config["objectives"]
Do not include additional columns
Categorical variables must be encoded as integer indices corresponding to their order in config["categorical"]["values"]
Also create a reactions_data_decoded.csv file including reaction data with actual categorical values.

Important

The column names and ordering must match the configuration exactly. AMLRO does not perform automatic column reconciliation.

Open-loop workflows allow AMLRO to be used with laboratory notebooks, external data acquisition systems, or third-party simulation pipelines.

Interactive Open-Loop Workflows

AMLRO is designed to support interactive optimization workflows without requiring manual editing of CSV files.

This is achieved by separating:

The AMLRO backend (reaction space, training data, optimization)
Uxsing interactive frontends - Interactive Google Colab notebook.

*Local web-based interface will be released near future.

Closed-Loop Workflow (Automated / Benchmarking)

For simulations, benchmarks, or algorithm development, AMLRO supports a fully automated closed-loop setup.

In this mode, objective values are computed programmatically and fed back into AMLRO in each iteration.

Minimal Closed-Loop Example

parameters = []
objectives = []

for i in range(training_size):

    parameters = generate_training_data(
        exp_dir=exp_dir,
        config=config,
        parameters=parameters,
        obj_values=objectives
    )

    objectives = objective_function(parameters)

generate_training_data(
    exp_dir=exp_dir,
    config=config,
    parameters=parameters,
    obj_values=objectives,
    termination=True
)

Explanation

Each iteration retrieves the next reaction condition
The user-defined objective_function evaluates the objectives
Results are appended to reactions_data.csv
The final call with termination=True ensures that all remaining reaction conditions are written without requesting further feedback

This workflow enables:

End-to-end autonomous optimization
Synthetic benchmarks (e.g., Branin, analytical test functions)
Integration with external automation systems

Relationship to the AMLRO Workflow

Training data generation:

Converts raw experimental results into a structured dataset
Maintains full compatibility with manual workflows
Acts as the only required input for active learning optimization

Once training data is available, users may proceed to:

Batch selection and model training
Prediction of optimal reaction conditions