.. _training_data: Training Data Generation ======================== This section describes how AMLRO incorporates **experimental or computational feedback** to build the training dataset used for active learning. This step bridges **reaction space definition** and **model-driven optimization**. Overview -------- After generating the initial reaction conditions (``training_combo.csv``), objective values must be provided before AMLRO can train a model. The training data generation step: - Collects reaction conditions and objective values - Builds a cumulative training dataset - Supports both **open-loop** (manual/inetractive notebook) and **closed-loop** (automated) workflows This functionality is accessed through the ``generate_training_data`` entry-point function. Entry-Point Function -------------------- .. code-block:: python generate_training_data( exp_dir=exp_dir, config=config, parameters=parameters, obj_values=objectives, filename='reactions_data.csv', termination=False ) This function is designed to be called **iteratively**, once objective values become available. Generated Files --------------- During training data generation, AMLRO creates or updates the following files inside ``exp_dir``, filename can be defined by user and default is ``reactions_data.csv``: - ``reactions_data.csv`` Encoded dataset used for machine learning model training - ``reactions_data_decoded.csv`` Human-readable version of the training dataset These files grow incrementally as new experimental or computational results are added. Open-Loop Workflow (Manual Update) ---------------------------------- This option is recommended for **experimental workflows** or expensive simulations where objective values are not available programmatically. Workflow ~~~~~~~~ 1. Perform experiments or simulations for the conditions listed in ``training_combo.csv`` 2. Manually create or update ``reactions_data.csv`` 3. Proceed to active learning once sufficient data is available File Format Requirements ~~~~~~~~~~~~~~~~~~~~~~~~ When creating or editing ``reactions_data.csv`` manually: - Include **only**: - Feature columns defined in ``config["continuous"]["feature_names"]`` - Feature columns defined in ``config["categorical"]["feature_names"]`` - Objective columns defined in ``config["objectives"]`` - Do **not** include additional columns - Categorical variables must be encoded as integer indices corresponding to their order in ``config["categorical"]["values"]`` - Also create a ``reactions_data_decoded.csv`` file including reaction data with actual categorical values. .. important:: The column names and ordering must match the configuration exactly. AMLRO does not perform automatic column reconciliation. Open-loop workflows allow AMLRO to be used with laboratory notebooks, external data acquisition systems, or third-party simulation pipelines. Interactive Open-Loop Workflows ------------------------------- AMLRO is designed to support **interactive optimization workflows** without requiring manual editing of CSV files. This is achieved by separating: - The **AMLRO backend** (reaction space, training data, optimization) - Uxsing **interactive frontends** - Interactive Google Colab notebook. *Local web-based interface will be released near future. Closed-Loop Workflow (Automated / Benchmarking) ----------------------------------------------- For simulations, benchmarks, or algorithm development, AMLRO supports a **fully automated closed-loop setup**. In this mode, objective values are computed programmatically and fed back into AMLRO in each iteration. Minimal Closed-Loop Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python parameters = [] objectives = [] for i in range(training_size): parameters = generate_training_data( exp_dir=exp_dir, config=config, parameters=parameters, obj_values=objectives ) objectives = objective_function(parameters) generate_training_data( exp_dir=exp_dir, config=config, parameters=parameters, obj_values=objectives, termination=True ) Explanation ~~~~~~~~~~~ - Each iteration retrieves the **next reaction condition** - The user-defined ``objective_function`` evaluates the objectives - Results are appended to ``reactions_data.csv`` - The final call with ``termination=True`` ensures that all remaining reaction conditions are written without requesting further feedback This workflow enables: - End-to-end autonomous optimization - Synthetic benchmarks (e.g., Branin, analytical test functions) - Integration with external automation systems Relationship to the AMLRO Workflow ---------------------------------- Training data generation: - Converts raw experimental results into a structured dataset - Maintains full compatibility with manual workflows - Acts as the **only required input** for active learning optimization Once training data is available, users may proceed to: - Batch selection and model training - Prediction of optimal reaction conditions