Unable to download nltk stopwords due to permission error

Hey there folks. First time user of Streamlit and I’m loving it. Also first time trying to deploy to Streamlit Cloud. My local seems to be working fine since it can download the nltk dataset for stopwords but I don’t think it has permission to do so in the vm.

Here’s the error:

[05:54:32] 🐍 Python dependencies were installed from /mount/src/streamlit_llamadocs_chat/requirements.txt using pip.

Check if streamlit is installed

Streamlit is already installed

[05:54:34] 📦 Processed dependencies!




[nltk_data] Downloading package stopwords to

[nltk_data]     /home/appuser/nltk_data...[2024-02-15 05:54:43.125453] 

[nltk_data]   Unzipping corpora/stopwords.zip.

[nltk_data] Downloading package stopwords to

[nltk_data]     /home/adminuser/venv/lib/python3.10/site-

[nltk_data]     packages/llama_index/core/_static/nltk_cache...

2024-02-15 05:54:43.763 Uncaught app exception

Traceback (most recent call last):

  File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/utils.py", line 60, in __init__

    nltk.data.find("corpora/stopwords", paths=[self._nltk_data_dir])

  File "/home/adminuser/venv/lib/python3.10/site-packages/nltk/data.py", line 583, in find

    raise LookupError(resource_not_found)

LookupError: 

**********************************************************************

  Resource stopwords not found.

  Please use the NLTK Downloader to obtain the resource:


  >>> import nltk

  >>> nltk.download('stopwords')

  

  For more information see: https://www.nltk.org/data.html


  Attempted to load corpora/stopwords


  Searched in:

    - '/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/_static/nltk_cache'

**********************************************************************



During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/adminuser/venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script

    exec(code, module.__dict__)

  File "/mount/src/streamlit_llamadocs_chat/main.py", line 8, in <module>

    from llama_index.core import VectorStoreIndex

  File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/__init__.py", line 8, in <module>

    from llama_index.core.base.response.schema import Response

  File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/base/response/schema.py", line 7, in <module>

    from llama_index.core.schema import NodeWithScore

  File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/schema.py", line 14, in <module>

    from llama_index.core.utils import SAMPLE_TEXT, truncate_text

  File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/utils.py", line 89, in <module>

    globals_helper = GlobalsHelper()

  File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/utils.py", line 62, in __init__

    nltk.download("stopwords", download_dir=self._nltk_data_dir)

  File "/home/adminuser/venv/lib/python3.10/site-packages/nltk/downloader.py", line 777, in download

    for msg in self.incr_download(info_or_id, download_dir, force):

  File "/home/adminuser/venv/lib/python3.10/site-packages/nltk/downloader.py", line 642, in incr_download

    yield from self._download_package(info, download_dir, force)

  File "/home/adminuser/venv/lib/python3.10/site-packages/nltk/downloader.py", line 701, in _download_package

    os.makedirs(os.path.join(download_dir, info.subdir))

  File "/usr/local/lib/python3.10/os.py", line 225, in makedirs

    mkdir(name, mode)

PermissionError: [Errno 13] Permission denied: '/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/_static/nltk_cache/corpora'

[05:54:43] ❗️

Here’s my app: https://llamadocschat.streamlit.app/
Here’s the source code: streamlit_llamadocs_chat/main.py at main · amnotme/streamlit_llamadocs_chat · GitHub
Python: 3.10
I’m using Llama_Index V0.10.3 which requires Nltk 3.8.1 so I can’t downgrade.

Any help is welcome. :slight_smile:

1 Like

hi @Leopoldo_Hernandez. Which python version you selected in your cloud app?

1 Like

@Guna_Sekhar_Venkata I forgot to add it at the bottom.
Updated it.

3.10 is the version I chose. Same as my local

1 Like

I think the application is running perfectly. It’s better to mention the package in single quotes like follows:-

nltk.download('stopwords')

Happy Streamlit-ing :balloon:

1 Like

okie dokes. I’ve gone ahead and changed it to single quotes.
and downgraded to python3.9

Re-deploying and hoping this works. :slight_smile:

1 Like

@Guna_Sekhar_Venkata Unfortunately It didn’t work. I redeployed doing the suggestions and I still get the permission error.

PermissionError: [Errno 13] Permission denied: ‘/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/_static/nltk_cache/corpora’

app: https://llamachatdocs.streamlit.app/
repo: streamlit_llamadocs_chat/main.py at main · amnotme/streamlit_llamadocs_chat · GitHub
python version: 3.9

any other suggestions :smiley:

11:29:12] 🐍 Python dependencies were installed from /mount/src/streamlit_llamadocs_chat/requirements.txt using pip.

Check if streamlit is installed

Streamlit is already installed

[11:29:13] 📦 Processed dependencies!




[nltk_data] Downloading package stopwords to

[nltk_data]     /home/appuser/nltk_data...

[nltk_data]   Unzipping corpora/stopwords.zip.

[nltk_data] Downloading package stopwords to

[nltk_data]     /home/adminuser/venv/lib/python3.9/site-

[nltk_data]     packages/llama_index/core/_static/nltk_cache...

2024-02-15 11:29:24.810 Uncaught app exception

Traceback (most recent call last):

  File "/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/utils.py", line 60, in __init__

    nltk.data.find("corpora/stopwords", paths=[self._nltk_data_dir])

  File "/home/adminuser/venv/lib/python3.9/site-packages/nltk/data.py", line 583, in find

    raise LookupError(resource_not_found)

LookupError: 

**********************************************************************

  Resource stopwords not found.

  Please use the NLTK Downloader to obtain the resource:


  >>> import nltk

  >>> nltk.download('stopwords')

  

  For more information see: https://www.nltk.org/data.html


  Attempted to load corpora/stopwords


  Searched in:

    - '/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/_static/nltk_cache'

**********************************************************************



During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/adminuser/venv/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script

    exec(code, module.__dict__)

  File "/mount/src/streamlit_llamadocs_chat/main.py", line 8, in <module>

    from llama_index.core import VectorStoreIndex

  File "/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/__init__.py", line 8, in <module>

    from llama_index.core.base.response.schema import Response

  File "/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/base/response/schema.py", line 7, in <module>

    from llama_index.core.schema import NodeWithScore

  File "/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/schema.py", line 14, in <module>

    from llama_index.core.utils import SAMPLE_TEXT, truncate_text

  File "/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/utils.py", line 89, in <module>

    globals_helper = GlobalsHelper()

  File "/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/utils.py", line 62, in __init__

    nltk.download("stopwords", download_dir=self._nltk_data_dir)

  File "/home/adminuser/venv/lib/python3.9/site-packages/nltk/downloader.py", line 777, in download

    for msg in self.incr_download(info_or_id, download_dir, force):

  File "/home/adminuser/venv/lib/python3.9/site-packages/nltk/downloader.py", line 642, in incr_download

    yield from self._download_package(info, download_dir, force)

  File "/home/adminuser/venv/lib/python3.9/site-packages/nltk/downloader.py", line 701, in _download_package

    os.makedirs(os.path.join(download_dir, info.subdir))

  File "/usr/local/lib/python3.9/os.py", line 225, in makedirs

    mkdir(name, mode)

PermissionError: [Errno 13] Permission denied: '/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/_static/nltk_cache/corpora'

[11:29:24] ❗️ 
1 Like

please fix this as this is not userland control but done lazily by the lib upon first run, so we can’t do anything about this. And streamlit should not restrict anything in the app from doing file operations imo. That is severely limiting the UX perspective.

are you part of the team @Guna_Sekhar_Venkata ?

Why not create proper jails and chown it to the runner of the code ?

1 Like

Hi @Morriz . I’m just helping him to get out of error

1 Like

Thanks for the help. While this hasn’t been resolved. I’ll continue to look for other ways around this. Not having file permissions does harm overall but I’m sure this isn’t the first time this happened. So i’ll check if anyone else has file permission issues as well.

1 Like

once try with the following:

from nltk.corpus import stopwords
1 Like

I see that the problem came down to a module attempting to set a download path
from an sdk module

the llamaIndex.core.utils.GlobalHelper wants to set a the download path here if there is no NLTK_DATA set as a global variable

class GlobalsHelper:
    """Helper to retrieve globals.

    Helpful for global caching of certain variables that can be expensive to load.
    (e.g. tokenization)

    """

    _stopwords: Optional[List[str]] = None
    _nltk_data_dir: Optional[str] = None

    def __init__(self) -> None:
        """Initialize NLTK stopwords and punkt."""
        import nltk

        self._nltk_data_dir = os.environ.get(
            "NLTK_DATA",
            os.path.join(
                os.path.dirname(os.path.abspath(__file__)),
                "_static/nltk_cache",
            ),
        )

you can set the global variable programmatically or simply add it to the secrets.toml via the manage app menu… i set it through the latter.

then you’ll need to point all of your downloads there.

import os
import nltk

nltk_data_dir = "./resources/nltk_data_dir/"
if not os.path.exists(nltk_data_dir):
    os.makedirs(nltk_data_dir, exist_ok=True)
nltk.data.path.clear()
nltk.data.path.append(nltk_data_dir)
nltk.download("stopwords", download_dir=nltk_data_dir)
nltk.download('punkt', download_dir=nltk_data_dir)

this has solved THIS issue but once the app is up and running when one attempts to cache resource… well… you need to write to the system and this… breaks as well

File "/home/adminuser/venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)
File "/mount/src/streamlit_llamadocs_chat/main.py", line 310, in <module>
    main_chat_functionality()
File "/mount/src/streamlit_llamadocs_chat/main.py", line 286, in main_chat_functionality
    index = get_index(api_key=st.session_state.openai_key)
File "/home/adminuser/venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 212, in wrapper
    return cached_func(*args, **kwargs)
File "/home/adminuser/venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 241, in __call__
    return self._get_or_create_cached_value(args, kwargs)
File "/home/adminuser/venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 268, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
File "/home/adminuser/venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 324, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
File "/mount/src/streamlit_llamadocs_chat/main.py", line 104, in get_index
    return VectorStoreIndex.from_vector_store(vector_store=vector_store)
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 103, in from_vector_store
    return cls(
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 74, in __init__
    super().__init__(
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 99, in __init__
    or transformations_from_settings_or_context(Settings, service_context)
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/settings.py", line 316, in transformations_from_settings_or_context
    return settings.transformations
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/settings.py", line 243, in transformations
    self._transformations = [self.node_parser]
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/settings.py", line 144, in node_parser
    self._node_parser = SentenceSplitter()
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/node_parser/text/sentence.py", line 91, in __init__
    self._tokenizer = tokenizer or get_tokenizer()
File "/home/adminuser/venv/lib/python3.10/site-packages/llama_index/core/utils.py", line 129, in get_tokenizer
    enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
File "/home/adminuser/venv/lib/python3.10/site-packages/tiktoken/model.py", line 101, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
File "/home/adminuser/venv/lib/python3.10/site-packages/tiktoken/registry.py", line 73, in get_encoding
    enc = Encoding(**constructor())
File "/home/adminuser/venv/lib/python3.10/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base
    mergeable_ranks = load_tiktoken_bpe(
File "/home/adminuser/venv/lib/python3.10/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe
    contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/adminuser/venv/lib/python3.10/site-packages/tiktoken/load.py", line 74, in read_file_cached
    with open(tmp_filename, "wb") as f:

Sooo… still looking but at least nltk is up and running

2 Likes

Last part for me was to set a caching directory for tiktoken module here.
Fortunately there was a global variable that I could also set up.

"TIKTOKEN_CACHE_DIR"

App is up and running

1 Like

Hi Leopoldo, I am having the exact same issue using LlamaIndex and Streamlit. Could you explain in more detail how you set the global variable in the secrets.toml file? Thank you!

1 Like

Sure thing. The secrets.toml file is populated via the advanced features or settings of the app. You can do this JUST before deploying it or after deploying it.

you’ll see the Settings gear icon once you click on the three dots.

you will then see the Secrets menu icon and it will display
an editor for you to add your secrets. THESE are effectively your environment variables to set a runtime.

I don’t use the secrets.toml file locally as I use a .env file with dotenv module to load them.

Add your secrets as follows

ONE_API="thisIsTheSecret"
ANOTHER_ENV_VAR="thisIsTheOtherSecret"
1 Like

Hi Leopoldo, thanks for your response. Yes, I’ve used secrets before but I’m confused on what exactly to set NLTK_DATA to

1 Like

Sure. So that variable should be the directory where you want your nltk data

Please make sure that this folder is writeable. So I just added mine directly where the app lives under resources.

NLTK_DATA="./resources/nltk_data_dir/"
2 Likes

Thanks! For some reason, after implementing these changes and running my app for a little bit I’m now getting this error…

1 Like

Figured it out: my app was going over resource limits. Thanks for the help!

2 Likes

Sweet!

2 Likes

I am getting the same error. Was working fine until yesterday.
Requirements.txt
streamlit
openai
llama-index
nltk

Permission denied: ‘/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/_static/nltk_cache/corpora’

LookupError:


Resource stopwords not found.

Please use the NLTK Downloader to obtain the resource:

import nltk

nltk.download(‘stopwords’)

For more information see: NLTK :: Installing NLTK Data

Attempted to load corpora/stopwords

Searched in:

- '/home/adminuser/venv/lib/python3.9/site-packages/llama_index/core/_static/nltk_cache'
1 Like