dmad
dmad6mo ago

Advanced memory management

Continuing the conversation with @evandertoorn Memory Management How does marimo keep things in memory? marimo internally manages a "globals" dict shared between all cells, everything that is defined is put into this dictionary. The dag primarily works with a static code analysis without respect to what has already been defined etc, to determine the order in which to run cells. Since the global dict is persistent during the session, it could potentially lead to memory build up. However, instead, variables are removed and collected by marimo on cell invalidation.
4 Replies
dmad
dmadOP6mo ago
Can we do even better Maybe. One of the current experimental features of marimo is "strict mode" enabled with:
[experimental]
execution_type = "strict"
[experimental]
execution_type = "strict"
This mode actively manages the exposed globals to the cell, creating cell specific "global" environments, and has additional active cleanup. To prevent cross cell memory mutation (which is possible but discouraged in marimo normal mode)- strict mode implicitly copies variables between cells (you can wrap variables with zero_copy in this mode to disable this behavior). One advantage to strict mode, is that this build up of any hidden state doesn't occur, but at the cost of copy overhead. One of the edge cases normal mode marimo does not catch is the following (maybe this is actually a bug @Akshay?) _my_var = 1 Then remove the reference to _my_var, and it will still remain secretly in memory. marimo doesn't clean this up since it has no context wrt the rest of the graph. Since strict mode accounts for all references, private or not, it removes _my_var if it determines it is not needed. Is strict mode worth it? I think it depends on your use case. You can try it out, and worst case disable it. It's experimental for a reason, but the more feedback it gets the better. If you frequently are prototyping with various private variables, strict mode will prevent this variable build up, but potentially at the cost of the "copy" in other cases. You can fight against this with "zero_copy" but lose some of the mutation protections. Best case you barely notice strict mode and have a possible memory boost due to the active gc, worst case there's a performance issue.
evandertoorn
evandertoorn6mo ago
Example use case:
# import cell
import polars as pl
import seaborn as sns
# raw parsing cell
huge_df = pl.read_parquet("huge.parquet")
# plot 1.
sns.histplot(huge_df, ...)
# plot 2.
sns.boxplot(huge_df, ...)
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
# Potentially, I'd be fine with indicating that this is where the need for it to exist stops
mo.drop(huge_df)
# ... further analysis
# import cell
import polars as pl
import seaborn as sns
# raw parsing cell
huge_df = pl.read_parquet("huge.parquet")
# plot 1.
sns.histplot(huge_df, ...)
# plot 2.
sns.boxplot(huge_df, ...)
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
# Potentially, I'd be fine with indicating that this is where the need for it to exist stops
mo.drop(huge_df)
# ... further analysis
The DAG could infer for further cells that the variable is no longer usable. w.r.t. strict mode, copying large dataframes (i.e. 60% of RAM) would not be feasible between cells.
dmad
dmadOP6mo ago
Can you add the cell divisions or is this all in one cell? If in a single cell
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
del huge_df
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
del huge_df
Works in both modes
# cell
huge_df = pl.read_parquet("huge.parquet")

# cell
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
required_del_ref = None # Trick marimo to always run this cell first

# cell
required_del_ref # included to ensure correct run order
del globals()["huge_df"]
# cell
huge_df = pl.read_parquet("huge.parquet")

# cell
# For further exploration, we actually only need a subset
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)
required_del_ref = None # Trick marimo to always run this cell first

# cell
required_del_ref # included to ensure correct run order
del globals()["huge_df"]
Not recommended but possible. Won't work in strick mode
# cell
required_del_ref # included to ensure correct run order
huge_df.drop(huge_df.index, inplace=True)
# cell
required_del_ref # included to ensure correct run order
huge_df.drop(huge_df.index, inplace=True)
Still not recommended, particular to dataframes. Will not work in strict mode
# cell
huge_df = zero_copy(pl.read_parquet("huge.parquet"))
# cell
required_del_ref # included to ensure correct run order
huge_df.drop(huge_df.index, inplace=True)
# cell
huge_df = zero_copy(pl.read_parquet("huge.parquet"))
# cell
required_del_ref # included to ensure correct run order
huge_df.drop(huge_df.index, inplace=True)
Will work in strict mode (not recommended) --- mo.drop is not easily possible since static analysis primarily works on variable name. You could just restructure your code though:
# cell
huge_df = pl.read_parquet("huge.parquet")
plot_fig1, _plot_ax1 = plt.figure()
plot_fig2, _plot_ax2 = plt.figure()
# Build diagrams without displaying
sns.histplot(huge_df, ..., ax=_plot_ax1)
sns.boxplot(huge_df, ..., ax=_plot_ax2)

# Export partial views
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)

del huge_df

# cell
plot_fig1

# cell
plot_fig2
# cell
huge_df = pl.read_parquet("huge.parquet")
plot_fig1, _plot_ax1 = plt.figure()
plot_fig2, _plot_ax2 = plt.figure()
# Build diagrams without displaying
sns.histplot(huge_df, ..., ax=_plot_ax1)
sns.boxplot(huge_df, ..., ax=_plot_ax2)

# Export partial views
partial_df = huge_df.filter(...)
partial_df2 = huge_df.group_by(...).agg(...)

del huge_df

# cell
plot_fig1

# cell
plot_fig2
That way huge_df is confined to a single cell. But this is annoying- because any change to partial_df or partial_df2 requires a rerun. Here's another suggestion with the proposed persistent_cache feature
# cell
@functools.cache
def load_huge():
return pl.read_parquet("huge.parquet")

# cell
with mo.persistent_cache(name="figures") as figures:
plot_fig1, _plot_ax1 = plt.figure()
plot_fig2, _plot_ax2 = plt.figure()
# Build diagrams without displaying
sns.histplot(load_huge(), ..., ax=_plot_ax1)
sns.boxplot(load_huge() ..., ax=_plot_ax2)

# cell
with mo.persistent_cache(name="partial_df"):
partial_df = load_huge().filter(...)

# cell
with mo.persistent_cache(name="partial_df2"):
partial_df2 = load_huge().group_by(...).agg(...)

# cell
figures, partial_df, partial_df2 # Ensure the above cells run first
load_huge.clear_cache()
# cell
@functools.cache
def load_huge():
return pl.read_parquet("huge.parquet")

# cell
with mo.persistent_cache(name="figures") as figures:
plot_fig1, _plot_ax1 = plt.figure()
plot_fig2, _plot_ax2 = plt.figure()
# Build diagrams without displaying
sns.histplot(load_huge(), ..., ax=_plot_ax1)
sns.boxplot(load_huge() ..., ax=_plot_ax2)

# cell
with mo.persistent_cache(name="partial_df"):
partial_df = load_huge().filter(...)

# cell
with mo.persistent_cache(name="partial_df2"):
partial_df2 = load_huge().group_by(...).agg(...)

# cell
figures, partial_df, partial_df2 # Ensure the above cells run first
load_huge.clear_cache()
In theory, on a secondary run, load_huge will never have to be called, and the cells will auto rerun/ reload huge_df if the code changes I guess you can do this now, without the persistent cache blocks actually. Persistence would just potentially make things faster on secondary runs / restarted kernels
evandertoorn
evandertoorn6mo ago
now this looks absolutely delightful