Nathan
Nathan2mo ago

`DataLoader` with `num_workers=1` crashes?

I'm learning about LLMs and was working through implementing this Jupyter notebook into Marimo and ran into a problem with a DataLoader trying to run in workers when there is a notebook local implementation of a Dataset. I suppose a solution is to move the ToyDataset class into a separate .py file, but is this the expected behavior? Depending on external files also means that my project needs to have modules fully setup and functional.
GitHub
LLMs-from-scratch/appendix-A/01_main-chapter-code/code-part2.ipynb ...
Implementing a ChatGPT-like LLM in PyTorch from scratch, step by step - rasbt/LLMs-from-scratch
9 Replies
Akshay
Akshay2mo ago
Sorry, i’m don’t follow, what problem did you run into?
Nathan
NathanOP2mo ago
This is the exception thrown when running from the command line. Same exception happens in the web notebook.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Program Files\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'ToyDataset' on <module '__mp_main__' from 'listing_A_part_2.py'>
Traceback (most recent call last):
File "listing_A_part_2.py", line 212, in <module>
app.run()
File ".venv\Lib\site-packages\marimo\_ast\app.py", line 298, in run
outputs, glbls = AppScriptRunner(InternalApp(self)).run()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\marimo\_runtime\app\script_runner.py", line 111, in run
raise e.__cause__ from None # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\marimo\_runtime\executor.py", line 170, in execute_cell
exec(cell.body, glbls)
File "listing_A_part_2.py", line 189, in <module>
for _batch_idx, (_features, _labels) in enumerate(train_loader):
^^^^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Program Files\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'ToyDataset' on <module '__mp_main__' from 'listing_A_part_2.py'>
Traceback (most recent call last):
File "listing_A_part_2.py", line 212, in <module>
app.run()
File ".venv\Lib\site-packages\marimo\_ast\app.py", line 298, in run
outputs, glbls = AppScriptRunner(InternalApp(self)).run()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\marimo\_runtime\app\script_runner.py", line 111, in run
raise e.__cause__ from None # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\marimo\_runtime\executor.py", line 170, in execute_cell
exec(cell.body, glbls)
File "listing_A_part_2.py", line 189, in <module>
for _batch_idx, (_features, _labels) in enumerate(train_loader):
^^^^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1327, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1293, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1144, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 22260) exited unexpectedly
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1327, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1293, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File ".venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1144, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 22260) exited unexpectedly
I got my project directory to be self-installed as an editable module and moved the ToyDataset class into its own file and imported from there, so I can move on. It is kinda a pain to setup the pyproject.toml and the editable install just for that. This also doesn't seem to be compatible with Marimo's sandbox feature? What would help is the ability to scaffold out a Python directory with a template that puts users on some sort of happy path for what they will need in the future. This would be helpful especially because there so many options around what can be done in Python with hacking the path, etc.. I've seen this sort of thing in the JavaScript community with npm init react-app my-app, but can't say I've seen it in Python.
Akshay
Akshay2mo ago
Are you running as a script, with marimo edit, or something else? Exact instructions on how to reproduce would help, starting by linking to a marimo notebook. I can’t reproduce from a jupyter notebook
Nathan
NathanOP5w ago
See the marimo-use-custom-dataloader branch here: https://github.com/ngbrown/build-llm-from-scratch/tree/marimo-use-custom-dataloader/llm_from_scratch/appx_a/listing_A_part_2.py This commit is how I resolved it: https://github.com/ngbrown/build-llm-from-scratch/commit/a72f3fcf7626719fe71acbbd6f238ac607ae06e6 Now I'm trying to figure out uv and what is the right thing to keep it compatible with marimo edit --sandbox. I was running with marimo edit but the error wasn't copyable. There was something in the middle that interrupted the selection. So that's why I pasted the error from the command line. Also, I think I've hit a dead-end on getting it to run with marimo edit --sandbox. I'm getting the following error:
> marimo edit --sandbox .\llm_from_scratch\appx_a\listing_A_part_2.py
Running in a sandbox: uv run --isolated --no-project --with-requirements C:\Users\USER\AppData\Local\Temp\tmpl00x1659.txt marimo edit .\llm_from_scratch\appx_a\listing_A_part_2.py
× No solution found when resolving `--with` dependencies:
╰─▶ Because llm-from-scratch was not found in the package registry and you require llm-from-scratch, we can conclude
that your requirements are unsatisfiable.
> marimo edit --sandbox .\llm_from_scratch\appx_a\listing_A_part_2.py
Running in a sandbox: uv run --isolated --no-project --with-requirements C:\Users\USER\AppData\Local\Temp\tmpl00x1659.txt marimo edit .\llm_from_scratch\appx_a\listing_A_part_2.py
× No solution found when resolving `--with` dependencies:
╰─▶ Because llm-from-scratch was not found in the package registry and you require llm-from-scratch, we can conclude
that your requirements are unsatisfiable.
To me this says the target file is copied somewhere by itself so there's no possibility of using shared code in the project folder. You can see this attempt here: https://github.com/ngbrown/build-llm-from-scratch/commit/8fccd4a9d422bfe1085d2f2f7bcb57c69cfee989 I had ran uv add --script .\llm_from_scratch\appx_a\listing_A_part_2.py ./ and uv add --script .\llm_from_scratch\appx_a\listing_A_part_2.py torch==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121 to populate the /// script header of the .py file, and then had to manually add the extra-index-url option. @Akshay With the github repo and mentioned branches, this should be reproducable. Any thoughts?
Akshay
Akshay5w ago
To me this says the target file is copied somewhere by itself so there's no possibility of using shared code in the project folder.
That's interesting, thanks for letting me know. This is a bug we should fix. llm_from_scratch shouldn't be added as a dependency I will make a GitHub issue to track Oh wait, you shouldn't be adding a local file as a dependency I see now that you manually added that
Nathan
NathanOP5w ago
llm_from_scratch shouldn't be added as a dependency
Because DataLoader spawns separate processes and can't access the Dataset that it needs from within a Marimo notebook cell function, I need to move that class into its own file and somehow a notebook needs to import modules from the local directory. As far as I know Python doesn't have a way to import a bare file (it needs the __init__.py marker file making the directory contents modules). Is there another preferred way?
Akshay
Akshay5w ago
Mm I see. Sorry for the short messages, a little busy today. We recently implemented support for uv-sources. I just cloned your repo and tried it, and it works (using marimo 0.9.10). Disregard my previous message on not adding a local file, since you are populating uv-sources your use case makes sense. Let me know if it works for you now?
Nathan
NathanOP4w ago
I got it to work. The complication is that the paths in the Marimo .py file header are relative to the command line, not the file itself. So for this:
# [tool.uv.sources]
# llm-from-scratch = { path = "../../" }
# [tool.uv.sources]
# llm-from-scratch = { path = "../../" }
This does not work:
> marimo edit --sandbox .\llm_from_scratch\appx_a\listing_A_part_2.py
> marimo edit --sandbox .\llm_from_scratch\appx_a\listing_A_part_2.py
But this does:
> cd .\llm_from_scratch\appx_a\
llm_from_scratch\appx_a> marimo edit --sandbox listing_A_part_2.py
> cd .\llm_from_scratch\appx_a\
llm_from_scratch\appx_a> marimo edit --sandbox listing_A_part_2.py
I was expecting that the paths to be relative to the file itself, not the current directory of the command running it. Is there a spec on the behavior?
Akshay
Akshay4w ago
I was expecting that the paths to be relative to the file itself, not the current directory of the command running it. Is there a spec on the behavior?
We match the behavior of running Python scripts, in python my_directory/my_script.py, the current working directory will be the directory of the command. We can clarify in our docs, perhaps in the FAQ. We do have a utility for constructing paths relative to the notebook directory (mo.notebook_dir()), but I guess that won't help for the script metadata. For the particular case of script metadata/sandbox, likely uv also matches the Python CLI's behavior when determining the current working directory.