Fingerprint databases

During training, Amp saves fingerprints (and the more expensive fingerprint derivatives) to an on-disk database so that they can be reused across runs or shared between multiple calculators trained on the same images.

During inference — including MD simulations, relaxations, and any use of a loaded calculator — Amp defaults to ephemeral mode: fingerprints are kept only in memory and are never written to disk. Each new image evicts the previous one, so memory usage stays bounded regardless of trajectory length. This is the default behaviour of Amp.load() and Bootstrap.load().

To opt into disk caching during inference (e.g., for re-training after load), pass an explicit dblabel string:

calc = Amp.load('amp.amp', dblabel='amp-data')

To share a single on-disk cache across multiple calculators during training, point them all at the same location:

calc1 = Amp(..., dblabel='shared-db')
calc2 = Amp(..., dblabel='shared-db')

Format

The database format is custom for Amp, and is designed to be as simple as possible. Amp databases end in the extension .ampdb. In its simplest form, it is just a directory with one file per image; that is, you will see something like below:

label-fingerprints.ampdb/
    loose/
        f60b3324f6001d810afbab9f85a6ea5f
        aeaaa21e5faccc62bae94c5c48b04031

In the above, each file in the directory “loose” is the hash of an image, and contains that image’s fingerprint. We use a file-based “database” to avoid conflicts with multiple processes accessing a database at the same time, which can cause conflicts.

However, for large training sets this can lead to lots of loose files, which can eat up a lot of memory, and with the large number of files slow down indexing jobs (like backups and scans). Therefore, you can compress the database with the amp-compress tool, described below.

Compress

To save disk space, you may periodically want to run the utility amp-compress (contained in the tools directory of the amp package; this should be on your path for normal installations). In this case, you would run amp-compress <filename>, which would result in the above .ampdb file being changed to:

label-fingerprints.ampdb/
    archive.tar.gz
    loose/

That is, the two fingerprints that were in the “loose” directory are now in the file “archive.tar.gz”.

You can also use the –recursive (or -r) flag to compress all ampdb files in or below the specified directory.

When Amp reads from the above database, it first looks in the “loose” directory for the fingerprint. If it is not there, it looks in “archive.tar.gz”. If it is not there, it calculates the fingerprint and adds it to the “loose” directory.

Future

We plan to make the amp-compress tool more automated. If the user does not supply a separate dblabel keyword, then we assume that their process is the only process using the database, and it is safe to compress the database at the end of their training job. This would automatically clean up the loose files at the end of the job.