This example describes how to use Dagster's versioning and memoization features.
Dagster can use versions to determine whether or not it is necessary to re-execute a particular step. Given versions of the code from each op in a job, the system can infer whether an upcoming execution of a step will differ from previous executions by tagging op outputs with a version. This allows for the outputs of previous runs to be re-used. We call this process memoized execution.
You can enable memoization functionality by providing a VersionStrategy
to your job. Dagster provides the SourceHashVersionStrategy
as a top-level export.
from dagster import SourceHashVersionStrategy, job
@job(version_strategy=SourceHashVersionStrategy())
def the_job():
...
When memoization is enabled, the outputs of ops will be cached. Ops will only be re-run if:
SourceHashVersionStrategy
, this only occurs when the code within your ops and resources changes.The following diagram shows how an op output version is computed.
Notice how the version of an output depends on all upstream output versions. Because of this, output versions are computed in topological order.
This diagram describes the computation of the version of a resource.
Resource versions are also computed in topological order, as resources can depend on other resources.
Memoization is enabled by using a version strategy on your job, in tandem with MemoizableIOManager
. In addition to the handle_output
and load_input
methods from the traditional IOManager
, MemoizableIOManager
s also implement a has_output
method. This is intended to check whether an output already exists that matches specifications.
Before execution occurs, the Dagster system will determine a set of which steps actually need to run. If using memoization, Dagster will check whether all the outputs of a given step have already been memoized by calling the has_output
method on the io manager for each output. If has_output
returns True
for all outputs, then the step will not run again.
Several of the persistent IO managers provided by Dagster are memoizable. This includes Dagster's default io manager, the s3_pickle_io_manager
and the fs_io_manager
.
There will likely be cases where the default SourceHashVersionStrategy
will not suffice. In these cases, it is advantageous to implement your own VersionStrategy
to match your requirements.
Check out the implementation of SourceHashVersionStrategy
as an example.
If you are using a custom IO manager and want to make use of memoization functionality, then your custom IO manager must be memoizable. This means they must implement the has_output
function. The OutputContext.get_output_identifier
will provide a path which includes version information that you can both store and check outputs to. Check out our implementations of io managers for inspiration.
Sometimes, you may want to run the whole job from scratch. Memoization can be disabled by setting the MEMOIZED_RUN_TAG
to false on your job.