`DataLoader` with `num_workers=1` crashes?
I'm learning about LLMs and was working through implementing this Jupyter notebook into Marimo and ran into a problem with a
DataLoader
trying to run in workers when there is a notebook local implementation of a Dataset
. I suppose a solution is to move the ToyDataset
class into a separate .py
file, but is this the expected behavior? Depending on external files also means that my project needs to have modules fully setup and functional.GitHub
LLMs-from-scratch/appendix-A/01_main-chapter-code/code-part2.ipynb ...
Implementing a ChatGPT-like LLM in PyTorch from scratch, step by step - rasbt/LLMs-from-scratch
9 Replies
Sorry, i’m don’t follow, what problem did you run into?
This is the exception thrown when running from the command line. Same exception happens in the web notebook.
I got my project directory to be self-installed as an editable module and moved the
ToyDataset
class into its own file and imported from there, so I can move on. It is kinda a pain to setup the pyproject.toml
and the editable install just for that.
This also doesn't seem to be compatible with Marimo's sandbox feature?
What would help is the ability to scaffold out a Python directory with a template that puts users on some sort of happy path for what they will need in the future. This would be helpful especially because there so many options around what can be done in Python with hacking the path, etc.. I've seen this sort of thing in the JavaScript community with npm init react-app my-app
, but can't say I've seen it in Python.Are you running as a script, with marimo edit, or something else? Exact instructions on how to reproduce would help, starting by linking to a marimo notebook. I can’t reproduce from a jupyter notebook
See the
marimo-use-custom-dataloader
branch here: https://github.com/ngbrown/build-llm-from-scratch/tree/marimo-use-custom-dataloader/llm_from_scratch/appx_a/listing_A_part_2.py
This commit is how I resolved it: https://github.com/ngbrown/build-llm-from-scratch/commit/a72f3fcf7626719fe71acbbd6f238ac607ae06e6
Now I'm trying to figure out uv
and what is the right thing to keep it compatible with marimo edit --sandbox
.
I was running with marimo edit
but the error wasn't copyable. There was something in the middle that interrupted the selection. So that's why I pasted the error from the command line.
Also, I think I've hit a dead-end on getting it to run with marimo edit --sandbox
. I'm getting the following error:
To me this says the target file is copied somewhere by itself so there's no possibility of using shared code in the project folder. You can see this attempt here: https://github.com/ngbrown/build-llm-from-scratch/commit/8fccd4a9d422bfe1085d2f2f7bcb57c69cfee989
I had ran uv add --script .\llm_from_scratch\appx_a\listing_A_part_2.py ./
and uv add --script .\llm_from_scratch\appx_a\listing_A_part_2.py torch==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121
to populate the /// script
header of the .py
file, and then had to manually add the extra-index-url
option.
@Akshay With the github repo and mentioned branches, this should be reproducable. Any thoughts?To me this says the target file is copied somewhere by itself so there's no possibility of using shared code in the project folder.That's interesting, thanks for letting me know. This is a bug we should fix.
llm_from_scratch
shouldn't be added as a dependency
I will make a GitHub issue to track
Oh wait, you shouldn't be adding a local file as a dependency
I see now that you manually added thatllm_from_scratch
shouldn't be added as a dependency
Because DataLoader
spawns separate processes and can't access the Dataset
that it needs from within a Marimo notebook cell function, I need to move that class into its own file and somehow a notebook needs to import modules from the local directory. As far as I know Python doesn't have a way to import a bare file (it needs the __init__.py
marker file making the directory contents modules). Is there another preferred way?Mm I see. Sorry for the short messages, a little busy today.
We recently implemented support for
uv-sources
. I just cloned your repo and tried it, and it works (using marimo 0.9.10).
Disregard my previous message on not adding a local file, since you are populating uv-sources
your use case makes sense.
Let me know if it works for you now?I got it to work.
The complication is that the paths in the Marimo
.py
file header are relative to the command line, not the file itself. So for this:
This does not work:
But this does:
I was expecting that the paths to be relative to the file itself, not the current directory of the command running it. Is there a spec on the behavior?I was expecting that the paths to be relative to the file itself, not the current directory of the command running it. Is there a spec on the behavior?We match the behavior of running Python scripts, in
python my_directory/my_script.py
, the current working directory will be the directory of the command.
We can clarify in our docs, perhaps in the FAQ.
We do have a utility for constructing paths relative to the notebook directory (mo.notebook_dir()
), but I guess that won't help for the script metadata.
For the particular case of script metadata/sandbox, likely uv
also matches the Python CLI's behavior when determining the current working directory.