Error when loading PDF file with LlamaIndex

Summary

I am trying to load a PDF file with llama-index But I think it can’t create the directories it needs when it indexes the file when run on streamlit-cloud

Steps to reproduce

**Code **

@st.cache_resource(show_spinner=False)
def load_data():
    with st.spinner(text="Loading and indexing PDF! This should take 1-2 minutes."):
        PDFReader = download_loader("PDFReader")

        loader = PDFReader(custom_path="local_dir") # tried this custom_path solution, didn'twork
        data_file = Path(__file__).parent / "data" / "myfile.pdf"
        docs = loader.load_data(file=data_file)
        service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0.5, system_prompt="You are an expert on the LFJCC biographies and your job is to answer biographical questions. Assume that all questions are related to the LFJCC Board biographies. Keep your answers  based on facts – do not hallucinate features."))
        index = VectorStoreIndex.from_documents(docs, service_context=service_context)
        return index

index = load_data()

Expected behavior:
Load and index the PDF

Actual behavior:

Throws error:

File "/home/adminuser/venv/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/mount/src/llama_jcc_app/pdf_app.py", line 57, in <module>
    index = load_data()
            ^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 211, in wrapper
    return cached_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 242, in __call__
    return self._get_or_create_cached_value(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 320, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mount/src/llama_jcc_app/pdf_app.py", line 48, in load_data
    PDFReader = download_loader("PDFReader")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.11/site-packages/llama_index/readers/download.py", line 117, in download_loader
    os.makedirs(dirpath)
File "<frozen os>", line 225, in makedirs

Debug info

streamlit version 1.26.0
python version 3.11.4

Requirements file


streamlit

openai

llama-index

nltk

Additional information

I tried this solution about adding the custom_path param, but it didn’t work. The app works fine on localhost.

Also, I can’t find the ‘Manage App’ link in the lower right to see the logs because it’s not there. But I’m pretty sure the function cannot make the directories on streamlit-cloud path

Hey @Amit_Indap,

Unfortunately, I ran into a similar issue trying to use LlamaHub connectors with Community Cloud – it generally won’t work because your app won’t be able to download the file to the working directory of Community Cloud. You’d need to either run the app locally or host it with another platform in order to successfully use LlamaHub connectors, sadly.

Hi @Caroline thanks for the reply. That’s a bummer. I recently followed this blog post about using LlamaIndex to make a chatbot for streamlit docs, and it deployed fine. It’s using using Simple Directory Reader.

I’m still very new to LLMs and streamlit, but maybe I can try SimpleDirectoryReader instead?

Got it to work with SimpleDirectoryReader and including pypdf in requirements.txt!

Yes, you can definitely use SimpleDirectoryReader if you just store the data for the knowledge base in the app repo itself. Glad to hear it’s working! :tada:

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

I think you can actually get it work by passing `custom_path=‘.’ when calling download_loader(), because that will just write to the current directory, rather than trying to write to wherever it writes by default.

3 Likes

Oh that’s awesome, I’ll have to try that!