XGBoost regression for `matbench_mp_e_form` task using basic crystallographic features

Created April 1, 2021

logo

Description

Give a brief overview of this notebook and your algorithm.

This directory is an example of a matbench submission, which should be made via pull-request (PR). This also is a minimum working example of how to use the matbench python package to run, record, and submit benchmarks with nested cross validation.

The benchmark used here is the original Matbench v0.1, as described in Dunn et al..

All submissions should include the following in a PR: - Description - Benchmark name - Package versions - Algorithm description - Relevant citations - Any other relevant info

Benchmark name

Name the benchmark you are reporting results for.

Matbench v0.1

Package versions

List all versions of packages required to run your notebook, including the matbench version used.

matbench==0.1.0
scikit-learn==0.24.1
numpy==1.20.1

Algorithm description

An in-depth explanation of your algorithm.

Submissions are limited to one algorithm per notebook.

The model here is a dummy (random) model as described in Dunn et al.. - Dummy classification model: randomly selects label in proportion to training+validation set. - Dummy regression model: predicts the mean of the training+validation set.

Relevant citations

List all relevant citations for your algorithm

Dunn et al.
Your model’s other citations go here.

Any other relevant info

Freeform field to include any other relevant info about this notebook, your benchmark, or your PR submission.

General notes on notebooks: - Please provide a short description for each code block, either - in markdown, as a separate cell - as inline comments - Keep the output of each cell in the final notebook - The notebook must be named ``notebook.ipynb``!

[1]:

# Import our required libraries and classes
%pip install matbench xgboost

from matbench.bench import MatbenchBenchmark
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pandas as pd
import numpy as np
from typing import List, Optional, Sequence, Tuple, Union

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matbench
  Downloading matbench-0.5-py3-none-any.whl (9.9 MB)
     |████████████████████████████████| 9.9 MB 2.9 MB/s
Requirement already satisfied: xgboost in /usr/local/lib/python3.7/dist-packages (0.90)
Collecting matminer==0.7.4
  Downloading matminer-0.7.4-py3-none-any.whl (1.4 MB)
     |████████████████████████████████| 1.4 MB 14.5 MB/s
Collecting monty==2021.8.17
  Downloading monty-2021.8.17-py3-none-any.whl (65 kB)
     |████████████████████████████████| 65 kB 886 kB/s
Collecting scikit-learn==1.0
  Downloading scikit_learn-1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.1 MB)
     |████████████████████████████████| 23.1 MB 6.4 MB/s
Requirement already satisfied: pandas>=1.3.1 in /usr/local/lib/python3.7/dist-packages (from matminer==0.7.4->matbench) (1.3.5)
Requirement already satisfied: pymongo>=3.12.0 in /usr/local/lib/python3.7/dist-packages (from matminer==0.7.4->matbench) (4.1.1)
Collecting pint>=0.17
  Downloading Pint-0.18-py2.py3-none-any.whl (209 kB)
     |████████████████████████████████| 209 kB 45.7 MB/s
Collecting pymatgen>=2022.0.11
  Downloading pymatgen-2022.0.17.tar.gz (40.6 MB)
     |████████████████████████████████| 40.6 MB 1.3 MB/s
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done
Collecting future>=0.18.2
  Downloading future-0.18.2.tar.gz (829 kB)
     |████████████████████████████████| 829 kB 50.5 MB/s
Requirement already satisfied: jsonschema>=3.2.0 in /usr/local/lib/python3.7/dist-packages (from matminer==0.7.4->matbench) (4.3.3)
Requirement already satisfied: tqdm>=4.62.0 in /usr/local/lib/python3.7/dist-packages (from matminer==0.7.4->matbench) (4.64.0)
Collecting six>=1.16.0
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: numpy>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from matminer==0.7.4->matbench) (1.21.6)
Collecting sympy>=1.8
  Downloading sympy-1.10.1-py3-none-any.whl (6.4 MB)
     |████████████████████████████████| 6.4 MB 50.3 MB/s
Collecting requests>=2.26.0
  Downloading requests-2.28.0-py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 1.7 MB/s
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==1.0->matbench) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==1.0->matbench) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==1.0->matbench) (3.1.0)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema>=3.2.0->matminer==0.7.4->matbench) (21.4.0)
Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema>=3.2.0->matminer==0.7.4->matbench) (5.7.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from jsonschema>=3.2.0->matminer==0.7.4->matbench) (4.2.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema>=3.2.0->matminer==0.7.4->matbench) (0.18.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from jsonschema>=3.2.0->matminer==0.7.4->matbench) (4.11.4)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.7/dist-packages (from importlib-resources>=1.4.0->jsonschema>=3.2.0->matminer==0.7.4->matbench) (3.8.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.3.1->matminer==0.7.4->matbench) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.3.1->matminer==0.7.4->matbench) (2022.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from pint>=0.17->matminer==0.7.4->matbench) (21.3)
Collecting scipy>=1.1.0
  Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
     |████████████████████████████████| 38.1 MB 1.2 MB/s
Requirement already satisfied: palettable>=3.1.1 in /usr/local/lib/python3.7/dist-packages (from pymatgen>=2022.0.11->matminer==0.7.4->matbench) (3.3.0)
Requirement already satisfied: plotly>=4.5.0 in /usr/local/lib/python3.7/dist-packages (from pymatgen>=2022.0.11->matminer==0.7.4->matbench) (5.5.0)
Collecting uncertainties>=3.1.4
  Downloading uncertainties-3.1.6-py2.py3-none-any.whl (98 kB)
     |████████████████████████████████| 98 kB 6.9 MB/s
Requirement already satisfied: tabulate in /usr/local/lib/python3.7/dist-packages (from pymatgen>=2022.0.11->matminer==0.7.4->matbench) (0.8.9)
Requirement already satisfied: networkx>=2.2 in /usr/local/lib/python3.7/dist-packages (from pymatgen>=2022.0.11->matminer==0.7.4->matbench) (2.6.3)
Collecting spglib>=1.9.9.44
  Downloading spglib-1.16.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325 kB)
     |████████████████████████████████| 325 kB 59.9 MB/s
Requirement already satisfied: matplotlib>=1.5 in /usr/local/lib/python3.7/dist-packages (from pymatgen>=2022.0.11->matminer==0.7.4->matbench) (3.2.2)
Collecting ruamel.yaml>=0.15.6
  Downloading ruamel.yaml-0.17.21-py3-none-any.whl (109 kB)
     |████████████████████████████████| 109 kB 58.5 MB/s
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5->pymatgen>=2022.0.11->matminer==0.7.4->matbench) (1.4.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5->pymatgen>=2022.0.11->matminer==0.7.4->matbench) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5->pymatgen>=2022.0.11->matminer==0.7.4->matbench) (3.0.9)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly>=4.5.0->pymatgen>=2022.0.11->matminer==0.7.4->matbench) (8.0.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26.0->matminer==0.7.4->matbench) (2.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26.0->matminer==0.7.4->matbench) (2022.5.18.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26.0->matminer==0.7.4->matbench) (1.24.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26.0->matminer==0.7.4->matbench) (2.10)
Collecting ruamel.yaml.clib>=0.2.6
  Downloading ruamel.yaml.clib-0.2.6-cp37-cp37m-manylinux1_x86_64.whl (546 kB)
     |████████████████████████████████| 546 kB 49.5 MB/s
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.7/dist-packages (from sympy>=1.8->matminer==0.7.4->matbench) (1.2.1)
Building wheels for collected packages: future, pymatgen
  Building wheel for future (setup.py) ... done
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491070 sha256=9d5ea7315ad5a4caa7e03026756b57d4dd3f27f0bef7b8a7414ea288b2a0dbfe
  Stored in directory: /root/.cache/pip/wheels/56/b0/fe/4410d17b32f1f0c3cf54cdfb2bc04d7b4b8f4ae377e2229ba0
  Building wheel for pymatgen (PEP 517) ... done
  Created wheel for pymatgen: filename=pymatgen-2022.0.17-cp37-cp37m-linux_x86_64.whl size=41841052 sha256=0ed5b117c6a8e3b46d6a9afefe6ed3c570eefdf9158f3d034141b32d1141eb6b
  Stored in directory: /root/.cache/pip/wheels/cf/f6/22/58a9be23c5f1b452770e02ff42047175eaf0f9c2f15219fc76
Successfully built future pymatgen
Installing collected packages: six, ruamel.yaml.clib, future, uncertainties, sympy, spglib, scipy, ruamel.yaml, requests, monty, scikit-learn, pymatgen, pint, matminer, matbench
  Attempting uninstall: six
    Found existing installation: six 1.15.0
    Uninstalling six-1.15.0:
      Successfully uninstalled six-1.15.0
  Attempting uninstall: future
    Found existing installation: future 0.16.0
    Uninstalling future-0.16.0:
      Successfully uninstalled future-0.16.0
  Attempting uninstall: sympy
    Found existing installation: sympy 1.7.1
    Uninstalling sympy-1.7.1:
      Successfully uninstalled sympy-1.7.1
  Attempting uninstall: scipy
    Found existing installation: scipy 1.4.1
    Uninstalling scipy-1.4.1:
      Successfully uninstalled scipy-1.4.1
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.28.0 which is incompatible.
google-colab 1.0.0 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed future-0.18.2 matbench-0.5 matminer-0.7.4 monty-2021.8.17 pint-0.18 pymatgen-2022.0.17 requests-2.28.0 ruamel.yaml-0.17.21 ruamel.yaml.clib-0.2.6 scikit-learn-1.0 scipy-1.7.3 six-1.16.0 spglib-1.16.5 sympy-1.10.1 uncertainties-3.1.6

Data type cannot be displayed: application/vnd.colab-display-data+json

[2]:

def training_model():
  # Transfer train_inputs and train_outputs into a pandas DataFrame
  X = pd.DataFrame(
      {
          "a": latt_a,
          "b": latt_b,
          "c":latt_c,
          "alpha": alpha,
          "beta": beta,
          "gamma": gamma,
          "volume": volume,
          "space_group": space_group
      },
      # index=material_id
  )
  y = pd.Series(name="formation_energy", data=formation_energy)

  X=X[:-2]

  train = xgb.DMatrix(X, label=y)

  hyperparam = {
      'max_depth': 4,
      'learning_rate':0.05,
      'n_estimators':1000,
      'verbosity':1,
      'booster':"gbtree",
      'tree_method':"auto",
      'n_jobs':1,
      'gamma':0.0001,
      'min_child_weight':8,
      'max_delta_step':0,
      'subsample':0.6,
      'colsample_bytree':0.7,
      'colsample_bynode':1,
      'reg_alpha':0,
      'reg_lambda':4,
      'scale_pos_weight':1,
      'base_score':0.6,
      'num_parallel_tree':1,
      'importance_type':"gain",
      'eval_metric':"rmse",
      'nthread':4 }

  num_round = 100

  # train and validate your model
  my_model = xgb.train(hyperparam, train, num_round)
  return my_model

def testing_model():
  # Create dataframe for test_inputs and test model
  test_inputs = pd.DataFrame(
    {
        "a": t_latt_a,
        "b": t_latt_b,
        "c": t_latt_c,
        "alpha": t_alpha,
        "beta": t_beta,
        "gamma": t_gamma,
        "volume":t_volume,
        "space_group": t_space_group
    },
  )

  test = xgb.DMatrix(test_inputs)
  return test

Running the actual benchmark

Create a benchmark of the 13 original matbench v0.1 tasks, train a model on each fold for each task, and record the results with any salient metadata.

[5]:

# Create a benchmark
mb = MatbenchBenchmark(autoload=False, subset=["matbench_mp_e_form"])

# Run our benchmark on xgboost model
for task in mb.tasks:
  task.load()

  for fold in task.folds:
    # Define lists and databases
    latt_a: List[List[float]] = []
    latt_b: List[List[float]] = []
    latt_c: List[List[float]] = []
    alpha: List[List[float]] = []
    beta: List[List[float]] = []
    gamma: List[List[float]] = []
    volume: List[float] = []
    space_group: List[int] = []

    formation_energy: List[List[float]] = []

    t_latt_a: List[List[float]] = []
    t_latt_b: List[List[float]] = []
    t_latt_c: List[List[float]] = []
    t_alpha: List[List[float]] = []
    t_beta: List[List[float]] = []
    t_gamma: List[List[float]] = []
    t_volume: List[float] = []
    t_space_group: List[int] = []

    # Get the training inputs (an array of pymatgen.Structure or string Compositions, e.g. "Fe2O3")
    train_inputs, train_outputs = task.get_train_and_val_data(fold)

    for i in range(len(train_inputs)):
      latt_a.append(train_inputs.iloc[i]._lattice.a)
      latt_b.append(train_inputs.iloc[i]._lattice.b)
      latt_c.append(train_inputs.iloc[i]._lattice.c)
      alpha.append(train_inputs.iloc[i]._lattice.angles[0])
      beta.append(train_inputs.iloc[i]._lattice.angles[1])
      gamma.append(train_inputs.iloc[i]._lattice.angles[2])
      volume.append(train_inputs.iloc[i].volume)
      space_group.append(train_inputs.iloc[i].get_space_group_info()[1])

    # Get the training outputs (an array of either bools or floats, depending on problem)
    for i in range(len(train_outputs)):
      formation_energy.append(train_outputs.iloc[i])

    # Do all model tuning and selection with the training data only
    # The split of training/validation is up to you and your algorithm
    # Transfer train_inputs and train_outputs into a pandas DataFrame

    X = pd.DataFrame(
        {
            "a": latt_a,
            "b": latt_b,
            "c":latt_c,
            "alpha": alpha,
            "beta": beta,
            "gamma": gamma,
            "volume": volume,
            "space_group": space_group
        },
        # index=material_id
    )
    y = pd.Series(name="formation_energy", data=formation_energy)

    train = xgb.DMatrix(X, label=y)

    hyperparam = {
        'max_depth': 4,
        'learning_rate':0.05,
        'n_estimators':1000,
        'verbosity':1,
        'booster':"gbtree",
        'tree_method':"auto",
        'n_jobs':1,
        'gamma':0.0001,
        'min_child_weight':8,
        'max_delta_step':0,
        'subsample':0.6,
        'colsample_bytree':0.7,
        'colsample_bynode':1,
        'reg_alpha':0,
        'reg_lambda':4,
        'scale_pos_weight':1,
        'base_score':0.6,
        'num_parallel_tree':1,
        'importance_type':"gain",
        'eval_metric':"rmse",
        'nthread':4 }

    num_round = 100

    # train and validate your model
    my_model = xgb.train(hyperparam, train, num_round)

    # Get test data (an array of pymatgen.Structure or string compositions, e.g., "Fe2O3")
    test_inputs_raw = task.get_test_data(fold, include_target=False)

    for i in range(len(test_inputs_raw)):
      t_latt_a.append(test_inputs_raw.iloc[i]._lattice.a)
      t_latt_b.append(test_inputs_raw.iloc[i]._lattice.b)
      t_latt_c.append(test_inputs_raw.iloc[i]._lattice.c)
      t_alpha.append(test_inputs_raw.iloc[i]._lattice.angles[0])
      t_beta.append(test_inputs_raw.iloc[i]._lattice.angles[1])
      t_gamma.append(test_inputs_raw.iloc[i]._lattice.angles[2])
      t_volume.append(test_inputs_raw.iloc[i].volume)
      t_space_group.append(test_inputs_raw.iloc[i].get_space_group_info()[1])

    test_inputs = pd.DataFrame(
      {
          "a": t_latt_a,
          "b": t_latt_b,
          "c": t_latt_c,
          "alpha": t_alpha,
          "beta": t_beta,
          "gamma": t_gamma,
          "volume":t_volume,
          "space_group": t_space_group
      },
    )

    test = xgb.DMatrix(test_inputs)

    # Make predictions on the test data, returning an array of either bool or float, depending on problem
    predictions = my_model.predict(test)

    # Record our predictions into the benchmark object
    # you can optionally add parameters corresponding to the particular model in this fold
    # if particular hyperparameters or configurations are chosen based on training/validation
    task.record(fold, predictions)

2022-06-10 22:23:12 INFO     Initialized benchmark 'matbench_v0.1' with 1 tasks:
['matbench_mp_e_form']
2022-06-10 22:23:12 INFO     Loading dataset 'matbench_mp_e_form'...
2022-06-10 22:26:56 INFO     Dataset 'matbench_mp_e_form loaded.
2022-06-10 22:39:05 INFO     Recorded fold matbench_mp_e_form-0 successfully.
2022-06-10 22:51:13 INFO     Recorded fold matbench_mp_e_form-1 successfully.
2022-06-10 23:03:22 INFO     Recorded fold matbench_mp_e_form-2 successfully.
2022-06-10 23:15:32 INFO     Recorded fold matbench_mp_e_form-3 successfully.
2022-06-10 23:27:40 INFO     Recorded fold matbench_mp_e_form-4 successfully.

[ ]:

print(len(mb.tasks))

[6]:

import pickle

model = pickle.dump(my_model, open("xgbmodel.dat", "wb"))
# model = pickle.load(open("xgbmodel.dat", "rb"))

Check out the results of the benchmark

First, validate the benchmark to make sure everything is ok - if you did not get any error messages during the recording process your benchmark results will almost certainly be valid.

Next, get a feeling for how our benchmark is doing, in terms of MAE or ROCAUC, along with various other scores.

Finally, add some metadata related to this benchmark, if applicable.

[7]:

# Make sure our benchmark is valid
valid = mb.is_valid
print(f"is valid: {valid}")


# Check out how our algorithm is doing using scores
import pprint
pprint.pprint(mb.scores)

# Get some more info about the benchmark
mb.get_info()

# Add some additional metadata about our algorithm
# These sections are very freeform; any and all data you think are relevant to your benchmark
# mb.add_metadata({"regression_strategy": "mean", "algorithm": "dummy"})

is valid: True
{'matbench_mp_e_form': {'mae': {'max': 0.7559645762744662,
                                'mean': 0.7514603730363221,
                                'min': 0.7463943260812504,
                                'std': 0.004167347004583424},
                        'mape': {'max': 8.208108588940437,
                                 'mean': 6.904368768866061,
                                 'min': 4.8884393331071925,
                                 'std': 1.323520300873098},
                        'max_error': {'max': 4.242506746409874,
                                      'mean': 4.057536813383573,
                                      'min': 3.9335069535836924,
                                      'std': 0.10426539042254096},
                        'rmse': {'max': 0.9454158134116134,
                                 'mean': 0.9414775887737938,
                                 'min': 0.936303190895938,
                                 'std': 0.0038121183426142904}}}
2022-06-11 00:01:46 INFO
Matbench package 0.5 running benchmark 'matbench_v0.1'
        is complete: False
        is recorded: True
        is valid: True

Results:
        - 'matbench_mp_e_form' MAE mean: 0.7514603730363221

Save our benchmark to file

Make sure you use the filename results.json.gz - this is important for our automated leaderboard to work properly!

[8]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

[9]:

# Save the valid benchmark to file to include with your submission
mb.to_file("/content/drive/MyDrive/sparks-baird/xtal2png/results.json.gz")

2022-06-11 00:02:22 INFO     Successfully wrote MatbenchBenchmark to file '/content/drive/MyDrive/sparks-baird/xtal2png/results.json.gz'.

Citation: Dunn, A., Wang, Q., Ganose, A., Dopp, D., Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Computational Materials 6, 138 (2020). https://doi.org/10.1038/s41524-020-00406-3

XGBoost regression for matbench_mp_e_form task using basic crystallographic features

Description

Benchmark name

Package versions

Algorithm description

Relevant citations

Any other relevant info

Running the actual benchmark

Check out the results of the benchmark

Save our benchmark to file

XGBoost regression for `matbench_mp_e_form` task using basic crystallographic features