Training Data Generation

This section describes how AMLRO incorporates experimental or computational feedback to build the training dataset used for active learning.

This step bridges reaction space definition and model-driven optimization.

Overview

After generating the initial reaction conditions (training_combo.csv), objective values must be provided before AMLRO can train a model.

The training data generation step:

  • Collects reaction conditions and objective values

  • Builds a cumulative training dataset

  • Supports both open-loop (manual/inetractive notebook) and closed-loop (automated) workflows

This functionality is accessed through the generate_training_data entry-point function.

Entry-Point Function

generate_training_data(
    exp_dir=exp_dir,
    config=config,
    parameters=parameters,
    obj_values=objectives,
    filename='reactions_data.csv',
    termination=False
)

This function is designed to be called iteratively, once objective values become available.

Generated Files

During training data generation, AMLRO creates or updates the following files inside exp_dir, filename can be defined by user and default is reactions_data.csv:

  • reactions_data.csv Encoded dataset used for machine learning model training

  • reactions_data_decoded.csv Human-readable version of the training dataset

These files grow incrementally as new experimental or computational results are added.

Open-Loop Workflow (Manual Update)

This option is recommended for experimental workflows or expensive simulations where objective values are not available programmatically.

Workflow

  1. Perform experiments or simulations for the conditions listed in training_combo.csv

  2. Manually create or update reactions_data.csv

  3. Proceed to active learning once sufficient data is available

File Format Requirements

When creating or editing reactions_data.csv manually:

  • Include only: - Feature columns defined in config["continuous"]["feature_names"] - Feature columns defined in config["categorical"]["feature_names"] - Objective columns defined in config["objectives"]

  • Do not include additional columns

  • Categorical variables must be encoded as integer indices corresponding to their order in config["categorical"]["values"]

  • Also create a reactions_data_decoded.csv file including reaction data with actual categorical values.

Important

The column names and ordering must match the configuration exactly. AMLRO does not perform automatic column reconciliation.

Open-loop workflows allow AMLRO to be used with laboratory notebooks, external data acquisition systems, or third-party simulation pipelines.

Interactive Open-Loop Workflows

AMLRO is designed to support interactive optimization workflows without requiring manual editing of CSV files.

This is achieved by separating:

  • The AMLRO backend (reaction space, training data, optimization)

  • Uxsing interactive frontends - Interactive Google Colab notebook.

*Local web-based interface will be released near future.

Closed-Loop Workflow (Automated / Benchmarking)

For simulations, benchmarks, or algorithm development, AMLRO supports a fully automated closed-loop setup.

In this mode, objective values are computed programmatically and fed back into AMLRO in each iteration.

Minimal Closed-Loop Example

parameters = []
objectives = []

for i in range(training_size):

    parameters = generate_training_data(
        exp_dir=exp_dir,
        config=config,
        parameters=parameters,
        obj_values=objectives
    )

    objectives = objective_function(parameters)

generate_training_data(
    exp_dir=exp_dir,
    config=config,
    parameters=parameters,
    obj_values=objectives,
    termination=True
)

Explanation

  • Each iteration retrieves the next reaction condition

  • The user-defined objective_function evaluates the objectives

  • Results are appended to reactions_data.csv

  • The final call with termination=True ensures that all remaining reaction conditions are written without requesting further feedback

This workflow enables:

  • End-to-end autonomous optimization

  • Synthetic benchmarks (e.g., Branin, analytical test functions)

  • Integration with external automation systems

Relationship to the AMLRO Workflow

Training data generation:

  • Converts raw experimental results into a structured dataset

  • Maintains full compatibility with manual workflows

  • Acts as the only required input for active learning optimization

Once training data is available, users may proceed to:

  • Batch selection and model training

  • Prediction of optimal reaction conditions