Sunthetics Inc

Accelerating Innovation in the Chemical Industry

Authors

Ethan Tenison

Daniela Blanco

César A. Urbina-Blanco

Published

1/12/24

This report was completed to analyze the cross-secitonal time series dataset AMC1 Modified.xlsx.

1 Data

EtACAC (equiv) Lewis Acid (equiv) Solvent (vol) Temp. (℃) Time (h) BQO (% IPC) AMC1 (% IPC) Dimer (% IPC) HQN Imp Key Imp
Loading... (need help?)

Descriptive Statistics

EtACAC (equiv) Lewis Acid (equiv) Solvent (vol) Temp. (℃) Time (h) BQO (% IPC) AMC1 (% IPC) Dimer (% IPC) HQN Imp Key Imp
count 17.000000 17.000000 17.000000 17.000000 17.000000 17.000000 17.000000 17.000000 17.000000 17.000000
mean 1.011765 0.700000 20.411765 100.588235 6.705882 0.661765 46.412353 16.786471 4.835294 8.758235
std 0.239485 0.203101 14.495943 16.093705 3.653161 1.047707 10.824634 12.469459 2.620609 6.162751
min 0.800000 0.400000 8.000000 85.000000 3.000000 0.000000 20.420000 2.320000 2.240000 0.000000
25% 0.800000 0.500000 10.000000 85.000000 4.000000 0.100000 39.930000 4.860000 3.020000 4.960000
50% 1.000000 0.800000 10.000000 95.000000 6.000000 0.540000 44.300000 16.750000 4.360000 7.630000
75% 1.200000 0.800000 30.000000 115.000000 12.000000 0.730000 55.020000 24.700000 4.990000 14.380000
max 1.600000 1.100000 50.000000 120.000000 12.000000 4.530000 60.680000 38.950000 10.920000 19.520000

Input Variable Configuration

For reference, the following Python dictionary describes the input and output conditions indicated by the client for the optimization.

```
{
    "EtACAC (equiv)": {
        "type": "float",
        "lim": [
            0.5,
            3.0
        ],
        "sigfigs": 1
    },
    "Lewis Acid (equiv)": {
        "type": "float",
        "lim": [
            0.4,
            2.0
        ],
        "sigfigs": 1
    },
    "Solvent (vol)": {
        "type": "int",
        "lim": [
            5,
            50
        ]
    },
    "Temp. (\u2103)": {
        "type": "int",
        "lim": [
            25,
            120
        ]
    },
    "Time (h)": {
        "type": "time",
        "lim": [
            3,
            4,
            6,
            12
        ],
        "int_mapping": {
            "3": 0,
            "4": 1,
            "6": 2,
            "12": 3
        },
        "doe_int_lim": [
            0,
            3
        ]
    }
}
```

Output Variable Configuration

```
{
    "BQO (% IPC)": {
        "optimization_target": "min",
        "optimization_important": true,
        "y_bounds": [
            0,
            null
        ],
        "y_constraint": null,
        "weight": 0.5
    },
    "AMC1 (% IPC)": {
        "optimization_target": "max",
        "optimization_important": true,
        "y_bounds": [
            0,
            null
        ],
        "y_constraint": null,
        "weight": 0.9
    },
    "Dimer (% IPC)": {
        "optimization_target": "min",
        "optimization_important": true,
        "y_bounds": [
            0,
            null
        ],
        "y_constraint": null,
        "weight": 0.5
    },
    "HQN Imp": {
        "optimization_target": "min",
        "optimization_important": true,
        "y_bounds": [
            0,
            null
        ],
        "y_constraint": null,
        "weight": 0.5
    },
    "Key Imp": {
        "optimization_target": "min",
        "optimization_important": true,
        "y_bounds": [
            0,
            null
        ],
        "y_constraint": null,
        "weight": 0.5
    }
}
```

Variable Distribution

Figure 1: Each histogram indicates the frequency of variable values throughout the full dataset

Variable Association

WARNING: EtACAC (equiv) and Solvent (vol) have a correlation of 0.91
Figure 2: Values close to either -1 or 1 indicate correlation. A value of 0 indicates no correlation.

2 Variable Effects

Variable Importance

Figure 3: Variable importance is in descending order -with the most important variable at the top.

Importance Across Predictions

This beeswarm plot visualizes the feature importance from the machine learning model. Each point on the plot corresponds to a prediction made by the model. The position on the x-axis shows the impact of the feature on the model’s prediction, while the color of the point represents the value of the feature.

Importance Across Predictions

Importance Across Predictions

Importance Across Predictions

Importance Across Predictions

Importance Across Predictions

Importance is in descending order.

Parallel Coordinates

A parallel coordinates plot is a visualization that displays multivariate data by representing each data point as a line connecting parallel axes. On a vertical axis, each variable is represented by a horizontal line or axis. The value of the variable is plotted as a point on the axis. This allows for the comparison of multiple variables at once.

Interaction

The interaction heatmaps visualizes potential interaction effects between inputs. Higher values indicate the possibility of interaction.The interaction values are relative, and they are not definitive indicators of interaction. Check the partial dependence plots for more information.

3 Model

Model Type
BQO (% IPC) Gradient Boosted Trees
AMC1 (% IPC) Gradient Boosted Trees
Dimer (% IPC) Gradient Boosted Trees
HQN Imp Gradient Boosted Trees
Key Imp Gradient Boosted Trees

Observed vs. Predicted

Observed vs. Predicted

Observed vs. Predicted

Observed vs. Predicted

Observed vs. Predicted

Observed vs. Predicted

The observed vs. predicted plot shows the model’s predictions against the actual values. The closer the points are to the diagonal line, the better the model’s predictions are at predicting the training set.

Error Summary

Table 1: These bootstrapped error metrics allow for a more comprehensive assessment of the model’s performance by considering its variability. Smaller standard deviations indicate greater confidence in the reported metric, while a larger std signals the presence of variability and potential model limitations.
Mean StdDev
BQO (% IPC) mae 0.444510 0.069216
medae 0.261022 0.064624
mse 0.667227 0.360903
rmse 0.783348 0.232081
rsquared 0.354163 0.349333
AMC1 (% IPC) mae 3.768082 0.659122
medae 2.272002 0.908754
mse 36.542757 12.463797
rmse 5.970611 0.948186
rsquared 0.668637 0.113019
Dimer (% IPC) mae 4.384472 1.215224
medae 2.216677 0.871223
mse 55.313808 24.643873
rmse 7.283778 1.507229
rsquared 0.622021 0.168400
HQN Imp mae 1.161839 0.109560
medae 0.473629 0.234369
mse 3.630523 0.855918
rmse 1.893809 0.210314
rsquared 0.438314 0.132421
Key Imp mae 3.007029 0.663680
medae 1.290539 0.629583
mse 27.521770 10.643890
rmse 5.158392 0.957785
rsquared 0.230062 0.297769

Bootstrapped Error Distributions

Bootstrapped Error Matrix

Bootstrapped Error Matrix

Bootstrapped Error Matrix

Bootstrapped Error Matrix

Bootstrapped Error Matrix

The table above displays histograms of bootstrapped error metrics, providing a visual representation of the distribution of performance. Examine the shape of each histogram to understand the central tendency and dispersion of the error metric. A symmetrical and bell-shaped histogram suggests a stable and consistent model performance, while skewed or multi-modal distributions may indicate variability or outliers.

4 Optimal Points

EtACAC (equiv) Lewis Acid (equiv) Solvent (vol) Temp. (℃) Time (h) BQO (% IPC) AMC1 (% IPC) Dimer (% IPC) HQN Imp Key Imp Type
0 2.3 0.5 49.0 120.0 4.0 1.353281 55.156666 7.940756 5.656612 0.000000 Optimal
1 2.1 0.8 34.0 118.0 6.0 1.913893 56.423599 2.668903 6.290057 11.587929 Optimal
2 2.4 1.0 27.0 34.0 3.0 0.714154 51.085278 21.338291 3.851384 12.596950 Optimal
3 2.4 1.2 17.0 65.0 3.0 0.714154 51.085278 21.338291 3.851384 12.596950 Optimal
4 1.1 0.7 17.0 84.0 12.0 0.043860 43.066807 25.331287 4.780689 3.957733 Optimal
5 1.4 2.0 32.0 120.0 4.0 1.583246 56.348412 8.207374 6.527459 5.418442 Uncertain
6 1.5 2.0 50.0 36.0 12.0 1.497841 58.229034 4.846352 6.245393 12.059023 Uncertain

These are the suggested points from the algorithm, and they include a mix of optimal and uncertain points to test.

BQO (% IPC) BQO (% IPC) -stdev AMC1 (% IPC) AMC1 (% IPC) -stdev Dimer (% IPC) Dimer (% IPC) -stdev HQN Imp HQN Imp -stdev Key Imp Key Imp -stdev
0 1.353281 0.720668 55.156666 2.747704 7.940754 5.670068 5.656613 1.073428 0.000000 4.426401
1 1.913893 0.909842 56.423599 2.211352 2.668901 4.453705 6.290058 0.943162 11.587931 2.814724
2 0.714154 0.775063 51.085278 1.983932 21.338289 4.580088 3.851384 0.852979 12.596950 3.117950
3 0.714154 0.762962 51.085278 1.888568 21.338289 4.513475 3.851384 0.852665 12.596950 3.057258
4 0.043860 0.513532 43.066811 2.488346 25.331285 7.110478 4.780689 0.955404 3.957733 2.831276
5 1.583246 0.942538 56.348412 3.075012 8.207374 7.006069 6.527459 0.852663 5.418442 4.111855
6 1.497841 1.148828 58.229034 2.120657 4.846352 4.460425 6.245393 1.050397 12.059023 3.575100

The table above contains the predicted outputs and along with the associated uncertainty for each point, calcualted using the bootstrapped standard deviations. A smaller standard deviation indicates greater confidence in the model’s prediction, while a larger standard deviation signals the presence of variability associated with the prediction.

Output Optimized EtACAC (equiv) Lewis Acid (equiv) Solvent (vol) Temp. (℃) Time (h) BQO (% IPC) AMC1 (% IPC) Dimer (% IPC) HQN Imp Key Imp
0 BQO (% IPC) 0.5 0.6 6.0 120.0 3.0 0.000000 39.298847 21.002005 2.625654 0.485350
1 AMC1 (% IPC) 2.0 2.0 50.0 105.0 12.0 2.411254 58.527691 0.444288 5.476033 12.269161
2 Dimer (% IPC) 2.6 0.7 42.0 114.0 12.0 1.570588 57.193531 0.115234 6.573816 4.419123
3 HQN Imp 0.5 1.7 6.0 119.0 3.0 0.000000 34.954605 21.507814 2.168596 10.947542
4 Key Imp 1.0 0.6 39.0 120.0 3.0 0.748022 46.258125 23.999989 5.076136 0.000000

The table above contains the optimal points for each output variable, optimized separately. It can be helpful if you would like to see the optimal values for the outputs without considering the multi-objective trade offs.

5 Suggested Points Accross Time

Suggested Time Series

Suggested Time Series

Suggested Time Series

Suggested Time Series

Suggested Time Series

6 Pareto