Advanced memory management
Continuing the conversation with @evandertoorn
Memory Management
How does marimo keep things in memory? marimo internally manages a "globals" dict shared between all cells, everything that is defined is put into this dictionary. The dag primarily works with a static code analysis without respect to what has already been defined etc, to determine the order in which to run cells. Since the global dict is persistent during the session, it could potentially lead to memory build up. However, instead, variables are removed and collected by marimo on cell invalidation.
4 Replies
Can we do even better
Maybe. One of the current experimental features of marimo is "strict mode" enabled with:
This mode actively manages the exposed globals to the cell, creating cell specific "global" environments, and has additional active cleanup. To prevent cross cell memory mutation (which is possible but discouraged in marimo normal mode)- strict mode implicitly copies variables between cells (you can wrap variables with
zero_copy
in this mode to disable this behavior). One advantage to strict mode, is that this build up of any hidden state doesn't occur, but at the cost of copy overhead. One of the edge cases normal mode marimo does not catch is the following (maybe this is actually a bug @Akshay?)
_my_var = 1
Then remove the reference to _my_var, and it will still remain secretly in memory. marimo doesn't clean this up since it has no context wrt the rest of the graph. Since strict mode accounts for all references, private or not, it removes _my_var if it determines it is not needed.
Is strict mode worth it?
I think it depends on your use case. You can try it out, and worst case disable it. It's experimental for a reason, but the more feedback it gets the better. If you frequently are prototyping with various private variables, strict mode will prevent this variable build up, but potentially at the cost of the "copy" in other cases. You can fight against this with "zero_copy" but lose some of the mutation protections.
Best case you barely notice strict mode and have a possible memory boost due to the active gc, worst case there's a performance issue.Example use case:
The DAG could infer for further cells that the variable is no longer usable.
w.r.t. strict mode, copying large dataframes (i.e. 60% of RAM) would not be feasible between cells.
Can you add the cell divisions or is this all in one cell?
If in a single cell
Works in both modes
Not recommended but possible. Won't work in strick mode
Still not recommended, particular to dataframes. Will not work in strict mode
Will work in strict mode (not recommended)
---
mo.drop
is not easily possible since static analysis primarily works on variable name.
You could just restructure your code though:
That way huge_df
is confined to a single cell. But this is annoying- because any change to partial_df
or partial_df2
requires a rerun.
Here's another suggestion with the proposed persistent_cache
feature
In theory, on a secondary run, load_huge
will never have to be called, and the cells will auto rerun/ reload huge_df
if the code changes
I guess you can do this now, without the persistent cache blocks actually. Persistence would just potentially make things faster on secondary runs / restarted kernelsnow this looks absolutely delightful