Uploaded pdf not parsing (Returns blank output) through llama in streamlit

johnsnow09 · September 18, 2024, 7:55am

I have uploaded a PDF file using st.file_uploader() from streamlit and then trying to parse it using LlamaParse() currently running on localhost.

Issue is I am getting blank list as output[] when I am using this uploaded_file object from the below code.

I am not sure if a returned uploaded object can directly be used to feed llama_parse as this PDF parsing works when I directly upload the PDF from my local machine directory into LlamaParse() instead of streamlit uploaded_file = st.file_uploader()

Looks like the file is not going through llama. Is there something that I need to do for using uploaded_file before feeding it into LlamaParse()?

I have also been through below links and used stringio but still stuck with errors or no results.

https://discuss.streamlit.io/t/expected-str-bytes-or-os-pathlike-object-not-

Stuck on this from days and also posted this on SO and really need help on this as it will otherwise jeopardize the whole project.

sample code:

import streamlit as st
import pandas as pd
import os
from io import StringIO
import nest_asyncio 

nest_asyncio.apply()
os.environ["LLMA_CLOUD_API_KEY"] = my_secret_key 
key_input = my_secret_key

from llama_parse import LlamaParse

############### Setting Configuration ###############

st.set_page_config(page_title="Pdf Parsing",
                    layout='wide',
                    initial_sidebar_state="expanded")

# title
st.markdown("<h1 style='text-align: center; font-size: 70px;color: black;letter-spacing: 4px;'>PDF Parsing</h1>", unsafe_allow_html=True)

st.write("checkpoint1")
with st.container(border=True):

    st.write("checkpoint2")
    st.session_state.uploaded_file = st.file_uploader('Choose your .pdf file to upload', type="pdf")

    if st.session_state.uploaded_file is not None:
        st.write(st.session_state.uploaded_file.name)
        st.success("File is uploaded")

        # https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader
        # https://discuss.streamlit.io/t/expected-str-bytes-or-os-pathlike-object-not-uploadedfile-for-pdf-file/10436
        st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))

if st.session_state.uploaded_file:

    # st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(st.session_state.uploaded_file)

    st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(st.session_state.uploaded_file_stringio)

    st.write('checkpoint3')
    st.write(st.session_state.doc_parsed)

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 11: invalid start byte
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 35, in <module>
    st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))

Version info

Python 3.12.4
streamlit 1.38.0
llama_parse 0.5.0
pandas 2.2.2

Goyo · September 18, 2024, 8:35am

According to the documentation I found you can pass a file object or just bytes. Trying to decode the bytes to text doesn’t make sense, since a pdf file is made of much more than encoded text.

johnsnow09 · September 18, 2024, 10:31am

Thank you for the quick reply.

When I directly use uploaded pdf file then I get [] blank list output. Code" st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input).load_data(st.session_state.uploaded_file)
When I use stringio on uploaded file then I am getting an error and as I understood from your reply that this step is not required.
If I follow llama_github documentation and use SimpleDirectoryReader() then what is the path that I need to mention for streamlit uploaded file ?

I have tried below codes based on SimpleDirectoryReader():

 parser = LlamaParse(
    api_key=key_input,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    verbose=True,

Attempt

    file_extractor = {".pdf": parser}
    documents = SimpleDirectoryReader(
    "./st.session_state.uploaded_file", file_extractor=file_extractor
    ).load_data()

Error: ValueError: Directory ./st.session_state.uploaded_file does not exist.

Attempt

    file_extractor = {".pdf": parser}
    documents = SimpleDirectoryReader(
    st.session_state.uploaded_file, file_extractor=file_extractor
    ).load_data()

Error:

Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 80, in <module>
    documents = SimpleDirectoryReader(
                ^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\llama_index\core\readers\file\base.py", line 259, in __init__
    if not self.fs.isdir(input_dir):
           ^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\fsspec\implementations\local.py", line 134, in isdir
    path = self._strip_protocol(path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\fsspec\implementations\local.py", line 233, in _strip_protocol
    if path.startswith("file://"):
       ^^^^^^^^^^^^^^^

I am not sure what else I can do. Can you please elaborate on what you are expecting me to do here?
Thanks !!

Goyo · September 18, 2024, 10:59am

Try passing the bytes st.session_state.uploaded_file.getvalue() instead.
Doesn’t make any sense to me and the error is expected (the file is not made of just utf-8-encoded text).
Uploaded files are kept in memory. If you want to read them from the file system you need to save them first.

johnsnow09 · September 18, 2024, 11:16am

Tried below two set of codes & couldn’t get results:

Getting blank list output and not the parsed list of document

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                ).load_data(st.session_state.uploaded_file.getvalue())

 st.write('checkpoint3')
 st.write(st.session_state.doc_parsed)

Output:

checkpoint3
[]

Tried below:

st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue())

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(st.session_state.uploaded_file_stringio)

st.write('checkpoint3')
st.write(st.session_state.doc_parsed)

Error:

TypeError: initial_value must be str or None, not bytes
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 39, in <module>
    st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue())
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Goyo · September 18, 2024, 1:05pm

Well, that seems to be what your LlamaParse instance can do with your file. I have never used LlamaParse, so I don’t know what to expect or what could be wrong.

As a last resort, compare that to what you get by passing the path to the file instead of the file itself.

parser.load_data("path/to/your/file.pdf")

johnsnow09 · September 18, 2024, 2:35pm

I have already parsed this pdf document few days back using llama_parser() without streamlit and it returns 20 elements equivalent to 20 pages. Its only when using streamlit I am facing issues.
And when I am using it in streamlit it gives me instant result (blank list) as if it had nothing to parse whereas it takes some time to process when I run llamaparse without streamlit.

See this screenshot:

johnsnow09 · September 19, 2024, 9:55am

I think the issue is that it requires file in .pdf format so when I am passing the variable/object instead of .pdf then it doesn’t work. May be if I host this app somewhere and then first save the uploaded file and then give that path to the llamaParse then it could work. Could be wrong here as I don’t have much experience in this.

Goyo · September 19, 2024, 11:57am

That is not the comparison you need.

Few days back is enough for something to change and you forgetting or just not noticing.
With / without streamlit can mean different environments. It is not uncommon for people to inadvertently switch environments, more so when running code using streamlit, ipython, jupyter, vscode…
The parser is instantiated with an extra argument in the notebook version.
…

You want to compare the result of passing the file name and the bytes in the same execution of the same program, using the same parser.

parsed_1 = parser.load_data(st.session_state.uploaded_file.getvalue())
parsed_2 = parser.load_data("path/to/your/file.pdf")

Passing the file name should have the same effect than passing the contents of the file. But the best way to know for sure is comparing the results while keeping everything else the same.

johnsnow09 · September 19, 2024, 12:54pm

Actually I ran both streamlit and parser 6 days back on 13th Sep and created this Stackoverflow post when I ran both:
Just Now Ran the parser again without streamlit and that gives results

image1401×832 86.2 KB
Just Now Ran the parser with in streamlit with below code

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                ).load_data(st.session_state.uploaded_file.getvalue())

        st.write('checkpoint3')
        st.write(st.session_state.doc_parsed)`

streamlit output:

I am still getting No Results when I run in Streamlit where as I get results when I work without streamlit.

Can someone pls check this from streamlit team with any of their pdfs ???

Goyo · September 19, 2024, 1:06pm

Since you seem focused in streamlit vs without streamlit, what if you use the file name in the streamlit application?

st.session_state.doc_parsed = LlamaParse(
    result_type="markdown",
    api_key=key_input
).load_data("path/to/your/file.pdf")

johnsnow09 · September 19, 2024, 2:57pm

Well this was a good idea to try and It has worked this way. As you suggested I sourced file from my local directory in streamlit instead of uploaded file:

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                ).load_data(r"V:\my_path\2024-08-03_VINEET-tests.pdf")

st.write('checkpoint3')
st.write(st.session_state.doc_parsed)

P.S:

So now how can I solve the puzzle of using uploaded file by each user for parsing?

Without user upload I won’t be able to do it in production with streamlit.

Please help !!!

johnsnow09 · September 20, 2024, 1:11pm

Hi.
As it was working with direct upload of file into the model and I came across this streamlit_link about creating temporary files and tried below code but still getting blank results. Do you think anything else can be tried as well or any modification to this step can help?

from tempfile import NamedTemporaryFile

if st.session_state.uploaded_file:

    with NamedTemporaryFile(suffix=".pdf") as temp:
        temp.write(st.session_state.uploaded_file.getvalue())
        temp.seek(0)
        st.write(temp.name)

        st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(temp.name)
        st.write("step completed")
        st.write(st.session_state.doc_parsed)

Output

C:\Users\vin\AppData\Local\Temp\tmpffwd7k9o.pdf

step completed

[]

Appreciate any sort of help!!

Goyo · September 20, 2024, 6:38pm

I was unable to test this until right now.

When I pass a file-like object (the uploaded file)

st.session_state.doc_parsed = LlamaParse(
    result_type="markdown", api_key=key_input
).load_data(st.session_state.uploaded_file)

I get an empty list as well, along with a message in the terminal:

Error while parsing the file '<bytes/buffer>': file_name must be provided in extra_info when passing bytes

Passing extra_info={"file_name": "_"} to load_data as sugested by the message fixed it for me. As far as I can tell any non-empty string works. This is absent from the examples and I have no idea why it is needed.

johnsnow09 · September 20, 2024, 7:37pm

Thanks aloootttttt. This really solved the issue for me as well.

It was my bad that I was not looking at terminal as it was returning blank list result. But your attention to details really helped.

Really Really Appreciate your help. Thanks again

system · September 22, 2024, 7:37pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to use uploaded pdf file for pdftotext parsing on streamlit Using Streamlit debugging	4	51	October 12, 2024
How to upload a pdf file in streamlit Using Streamlit file-upload	14	27441	May 28, 2024
httpx.ReadTimeout: timed out when running Llama3 model that queries PDF files on Streamlit Using Streamlit llms , debugging	1	775	November 19, 2024
Make chatbot to read and answer from pdf files Community Cloud	23	7569	June 12, 2024
Expected str, bytes or os.PathLike object, not UploadedFile for PDF file Community Cloud	3	2986	May 13, 2022

Uploaded pdf not parsing (Returns blank output) through llama in streamlit

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies