Stats¶
The stats module contains a bootstrap method for addresssing uncertainty.
Bootstrap¶
- class amp.stats.bootstrap.BootStrap(log=None, label='bootstrap', retrain_initiated=False, from_scratch=False)[source]¶
Bases:
objectA bootstrap ensemble calculator which serves as a wrapper around and Amp calculator. Initiate with an amp.utilities.Logger instance as log.
If an existing trained bootstrap calculator is available, it can be loaded by providing its filename to the load keyword.
Note that the ‘train’ method is meant to be a job-submission and -management script; e.g., it will typically be run at the command line to both submit jobs and monitor their convergence.
- get_atomic_energies(atoms, output=('m',), k=None)[source]¶
Returns the energy per atom from the ensemble.
The output keyword controls what is returned; see get_potential_energy for the full description of output tokens (m, <q>, e) and the k parameter.
- get_charges(atoms, output=('m',), k=None)[source]¶
Returns the atomic charges from the ensemble. Charges are given in units of the negative electron charge (i.e., -0.1 means one tenth of an electron more than neutral).
The output keyword controls what is returned; see get_potential_energy for the full description of output tokens (m, <q>, e) and the k parameter.
- get_excess_Ne(atoms, output=('m',), k=None)[source]¶
Returns the number of excess electrons from the ensemble for the atoms object.
By default returns only the mean ensemble prediction.
The output keyword controls what is returned; see get_potential_energy for the full description of output tokens (m, <q>, e) and the k parameter.
- get_forces(atoms, output=('m',), k=None)[source]¶
Returns the atomic forces from the ensemble for the atoms object.
By default returns only the mean ensemble prediction, making this a drop-in ASE calculator. The mean is used (not the median) because for a conservative system the mean force is the negative gradient of the mean energy – a consistency that does not hold for the median.
The output keyword controls what is returned; see get_potential_energy for the full description of output tokens (m, <q>, e) and the k parameter. Note that quantiles are computed per force component independently (not per atom vector magnitude).
- get_potential_energy(atoms, output=('m',), k=None)[source]¶
Returns the potential energy from the ensemble for the atoms object.
By default returns only the mean ensemble prediction, making this a drop-in ASE calculator. The mean is used (not the median) because for a conservative system the mean force is the negative gradient of the mean energy – a consistency that does not hold for the median.
To get uncertainty information, use the output keyword with the following codes:
m: return the mean of the ensemble predictions
<q>: (where <q> is a float) return the q quantile of the ensemble (where the quantile is a decimal, as in 0.5 for 50th percentile)
e: return the whole ensemble predictions as a list
Tokens can be combined in any order, e.g. output=[‘m’, .05, .95] returns the mean plus a 90% centered interval, and appending ‘e’ also returns the raw ensemble. A scalar is returned when only one token is requested, so that the default call is ASE-like.
- kint or None
If specified, use only the first k ensemble members. Useful for quick estimates without running the full ensemble.
- classmethod load(file, label='amp', dblabel=False, log=None)[source]¶
Load a trained bootstrap ensemble from a .ensemble file.
- Parameters:
file (str) – Path to the .ensemble file to load.
label (str) – Label assigned to each loaded Amp calculator; controls the log file name.
dblabel (str, None, or False) – Prefix/location for fingerprint database files. Defaults to False, meaning fingerprints are held only in memory and never written to disk (recommended for inference). All ensemble members share a single descriptor instance, so fingerprints are computed once per image regardless. Pass a string path to re-enable on-disk caching, e.g. for re-training after load.
log (Logger or None) – Logger instance. If None, logs to <label>-log.txt.
- set(**kwargs)[source]¶
Forward parameter settings to all ensemble calculators.
This mirrors the ASE/Amp calculator interface. In particular, electrode_potential must be set here before calling get_potential_energy() / get_forces() when using a ChargeNeuralNetwork ensemble, just as calc.set(electrode_potential=…) is required for a plain Amp calculator.
- train(images, n=50, calc_text="\nfrom amp import Amp\nfrom amp.descriptor.gaussian import Gaussian\nfrom amp.model.neuralnetwork import NeuralNetwork\n\ncalc = Amp(descriptor=Gaussian(),\n model=NeuralNetwork(),\n dblabel='../amp-db')\ncalc.model.lossfunction.parameters['weight_duplicates'] = False\n", headerlines='', start_command='python run.py', sleep=0.1, expired=3600.0, train_line='calc.train(images=trainfile)', label='bootstrap', new_images=None, nft_ids=None, charge_training=False, archive=True, remove_dir=True)[source]¶
Trains a bootstrap ensemble of calculators.
This is set up to enable the submission of each as a job through the local queuing system, but can also run in serial. On first call to this method, jobs are created/submitted. On subsequent calls, jobs are analyzed for convergence. If all are converged, an ensemble is created and the training directory is archived.
- Parameters:
n (int) – size of ensemble (number of calculators to train)
calc_text (str) – text that is used to initiate the Amp calculator. see the example in this module in calc_text; must produce a ‘calc’ object
headerlines (str) – lines in the top of the python script that will be submitted this would typically contain comment lines for the batching system, such as ‘#SBATCH -n=8…’
start_command (str) – command to start the job in the current queuing system, such as ‘sbatch run.py’ (‘run.py’ is the scriptname here) for serial operation use ‘python run.py’
sleep (float) – time (s) to sleep between job submissions
train_line (str) – line to use to train each amp instance; usually the default is fine but user may want to use this to insert additional keywords such as train_forces=False
label (string) – label to give final trained calculator
expired (float) – When checking jobs, age (s) of log file at which to consider that the job is no longer running (timed out) and should be restarted.
retrain (bool) – if the run is for retraining.
new_images (str) – new_images added to the original images for retraining.
nft_ids (list of tuples) – list of length-two tuples indicating images to be trained only on forces of central atoms.
charge_training (bool) – if charge training is applied.
archive (bool) – after training, if the training directory is archived.
remove_dir (bool) – after archiving, if the directory is deleted.
- Returns:
results – A dictionary indicating the state of training. This dictionary always contains a key ‘complete’ key with value of True or False indicating if training is complete. If False, also provides statistics on number converged.
- Return type:
- class amp.stats.bootstrap.TrainingArchive(name)[source]¶
Bases:
objectHelper to get training trajectories and Amp calc instances from the training tar ball. Initialize with archive name. The get commands use the path the file would have had if the archive were extracted.
- amp.stats.bootstrap.archive_directory(source_dir, remove_dir=False, suffix='')[source]¶
Turns <source_dir> into a .tar.gz file and removes the original directory.
- amp.stats.bootstrap.bootstrap(vector, size=None, return_missing=False)[source]¶
Returns a randomly chosen, with replacement, version of the data set. If size is None returns a vector of same length. To pull from sample from multiple vectors, zip and unzip them like:
>>> xsbs, ysbs = zip(*bootstrap(zip(xs, ys)))
If return_missing == True, also finds and returns the missing elements not sampled from the vector as a second output.