How HitSave Works¶
Prerequisites¶
If you haven’t done so already, follow the steps in the installation guide to get setup with HitSave.
The HitSave Decorators¶
The core object in HitSave is the @memo
decorator. If you’ve used
@functools.lru_cache
you’ll be familiar with the notion of caching function executions.
@memo
¶
HitSave’s @memo
behaves similarly to lru_cache
but incorporates a few
powerful additions:
Rather than storing previous function evaluations in memory (where they are only available for the current execution session) they are instead persisted to disk and reusable in future execution sessions.
HitSave syncs your cache to the cloud. Soon, HitSave will allow teams to have shared caches, so that if code has been run once on your team, no-one else has to wait for the results if they re-run the same code with the same inputs.
HitSave uses a much more sophisticated caching algorithm than
lru_cache
. Before running your code, HitSave statically analyses the code dependency tree of your@memo
’d function, and uses hash digests of both the code itself and the data you pass in. If you edit your code, or pass different arguments, HitSave automatically invalidates the cache so that you are always returned correct values.
With this abstraction, HitSave enables you to persistently memoize long-running functions, and avoid manually saving intermediate results to persistent disk storage.
For example, imagine you are ingesting a large dataset
through an ETL pipeline. Instead of using Python’s native file API to save
the state of your dataset to your local machine at each step, you can
instead use @memo
to persist the dataset to a managed local cache, as
well as automatically syncing this to the cloud where it can be accessed
by other team members. Now, if you need to edit a downstream step in your
ETL pipeline, there’s no need to reload the previous step’s output as a
starting point: simply running the whole pipeline again will have the
effect of immediately picking up from the latest unchanged step and
calculating the new results.
Because HitSave is on the cloud, this works even if earlier steps were run on different computers!
@experiment
¶
An interesting side-effect of HitSave’s caching mechanics is that it means it can also be used to manage experiments, similar to tools such as MLFlow.
HitSave provides a second core decorator called @experiment
.
@experiment
behaves exactly the same as @memo
, but sets a flag telling HitSave that
the result is of interest and should never be automatically wiped from the cache.
We’ve built a dashboard in the web interface where you can inspect all your experiments and see visualized arguments and return values in the browser.
Here’s a rough example of how you could use this functionality to get useful experiment tracking up and running in just a few lines of code.
lr = 0.01
batch_size = 200
@experiment
def train_model(lr, batch_size):
training_dataset, test_dataset = get_datasets()
model = NeuralNetwork()
writer = []
for epoch in range(epochs):
model.train(lr, training_dataset)
writer.append([{
'epoch': epoch,
'accuracy': accuracy(test_dataset, model)
}])
df = pd.DataFrame(writer)
fig = px.line(df, x='epoch', y='accuracy')
return model, fig
Perhaps you already had some code which looked like this, minus the
@experiment
decorator.
Here’s what happens:
You define parameters for your experiment like learning rate and batch size. These are passed into the function, so HitSave can interpret the relationship between input and output.
You load in your datasets. The
get_datasets
function could even be an@memo
’d function.You train the model for a number of epochs and calculate the accuracy performance on the test dataset at the end of each epoch.
Along the way, you append the accuracy data to a writer (e.g. something like a Tensorboard log).
At the end, you construct a figure, plotting the accuracy at each epoch.
Finally, you return the model and the figure.
Now, when you visit the cloud experiment tracker you’ll see a row in the table displaying the experiment you ran. It includes the values of the learning rate and batch size parameters passed into the function, as well as displaying the returned accuracy plot on screen.
Now you can come back to the editor and try some different values of the
input parameters. Each time you run the code, you’ll get a new row in
the table. You could even call the @experiment
function many times in
the same execution session by nesting it in a for-loop - for example to
perform a hyperparameter sweep. Best of all, if you rerun the function
with the same parameters as a previous run, you’ll get the output
instantaneously from the cache (well, at least as quickly as your
computer can download it from the cloud or local disk).
In the future, we’re going to make it possible to share caches and experiments among team members, as well as allowing you to directly download cached artifacts into a Python session by referencing a unique identifier which you’ll be able to grab from the interface.
When to Use @memo
¶
Not all functions are appropriate for @memo
. A good rule-of-thumb is asking
whether the function is a good target for other forms of caching such as
lru_cache
. Here are some considerations for whether a function can be useful
for saving:
Does the function takes a long time to run? If the function is fast, it’s hardly worth the overhead of hashing arguments and downloading the saved value.
Are the arguments to the function easy to hash? If you pass a huge tensor to the function, HitSave will have to hash the entire object before it can determine if the function is already hashed.
Does the function cause side-effects? A side-effect is a change that the function makes to the environment that isn’t in the function’s return value. Examples are: modifying a global value, or changing a file on disk.
The function doesn’t depend on a changing external resource (for example, polling a web API for the weather).
The function doesn’t implicitly depend on the state of the filesystem.