Training Data Generation
This section describes how AMLRO incorporates experimental or computational feedback to build the training dataset used for active learning.
This step bridges reaction space definition and model-driven optimization.
Overview
After generating the initial reaction conditions (training_combo.csv),
objective values must be provided before AMLRO can train a model.
The training data generation step:
Collects reaction conditions and objective values
Builds a cumulative training dataset
Supports both open-loop (manual/inetractive notebook) and closed-loop (automated) workflows
This functionality is accessed through the generate_training_data entry-point
function.
Entry-Point Function
generate_training_data(
exp_dir=exp_dir,
config=config,
parameters=parameters,
obj_values=objectives,
filename='reactions_data.csv',
termination=False
)
This function is designed to be called iteratively, once objective values become available.
Generated Files
During training data generation, AMLRO creates or updates the following files
inside exp_dir, filename can be defined by user and default is reactions_data.csv:
reactions_data.csvEncoded dataset used for machine learning model trainingreactions_data_decoded.csvHuman-readable version of the training dataset
These files grow incrementally as new experimental or computational results are added.
Open-Loop Workflow (Manual Update)
This option is recommended for experimental workflows or expensive simulations where objective values are not available programmatically.
Workflow
Perform experiments or simulations for the conditions listed in
training_combo.csvManually create or update
reactions_data.csvProceed to active learning once sufficient data is available
File Format Requirements
When creating or editing reactions_data.csv manually:
Include only: - Feature columns defined in
config["continuous"]["feature_names"]- Feature columns defined inconfig["categorical"]["feature_names"]- Objective columns defined inconfig["objectives"]Do not include additional columns
Categorical variables must be encoded as integer indices corresponding to their order in
config["categorical"]["values"]Also create a
reactions_data_decoded.csvfile including reaction data with actual categorical values.
Important
The column names and ordering must match the configuration exactly. AMLRO does not perform automatic column reconciliation.
Open-loop workflows allow AMLRO to be used with laboratory notebooks, external data acquisition systems, or third-party simulation pipelines.
Interactive Open-Loop Workflows
AMLRO is designed to support interactive optimization workflows without requiring manual editing of CSV files.
This is achieved by separating:
The AMLRO backend (reaction space, training data, optimization)
Uxsing interactive frontends - Interactive Google Colab notebook.
*Local web-based interface will be released near future.
Closed-Loop Workflow (Automated / Benchmarking)
For simulations, benchmarks, or algorithm development, AMLRO supports a fully automated closed-loop setup.
In this mode, objective values are computed programmatically and fed back into AMLRO in each iteration.
Minimal Closed-Loop Example
parameters = []
objectives = []
for i in range(training_size):
parameters = generate_training_data(
exp_dir=exp_dir,
config=config,
parameters=parameters,
obj_values=objectives
)
objectives = objective_function(parameters)
generate_training_data(
exp_dir=exp_dir,
config=config,
parameters=parameters,
obj_values=objectives,
termination=True
)
Explanation
Each iteration retrieves the next reaction condition
The user-defined
objective_functionevaluates the objectivesResults are appended to
reactions_data.csvThe final call with
termination=Trueensures that all remaining reaction conditions are written without requesting further feedback
This workflow enables:
End-to-end autonomous optimization
Synthetic benchmarks (e.g., Branin, analytical test functions)
Integration with external automation systems
Relationship to the AMLRO Workflow
Training data generation:
Converts raw experimental results into a structured dataset
Maintains full compatibility with manual workflows
Acts as the only required input for active learning optimization
Once training data is available, users may proceed to:
Batch selection and model training
Prediction of optimal reaction conditions