How to enable raw string literal 'r' and binary format 'rb' during pdf upload/read?

Hello!

My goal is I am trying to create an upload pdf option in streamlit v1.3.0 as a part of my NLP project written in python 3.9.5 (Jupyter-lab kernel).

My Python code is like below to read the pdf file:

import PyPDF2 as pdf

file = open(r"C:\Users\<path>\Documents\ebook.pdf", 'rb')
pdf_reader = pdf.PdfFileReader(file)

text=''
for i in range(0,pdf_reader.numPages):
    pageObj = pdf_reader.getPage(i)
    text=text+pageObj.extractText()
print(text)

With my current streamlit code, I’m able to upload the pdf file:

uploaded_file = st.file_uploader("Choose a file", type="pdf")

if uploaded_file is not None:
    pdf_reader = pdf.PdfFileReader(uploaded_file)

    text=''
    for i in range(0,pdf_reader.numPages):
        pageObj = pdf_reader.getPage(i)
        text=text+pageObj.extractText()
    st.write(text)

But the issue is, I’m not sure how to enable the raw string literal and converting the reading as binary format during this pdf upload/reading i.e. the ‘r’ and ‘rb’ usage in python’s open method,

file = open(r"C:\Users<path>\Documents\ebook.pdf", ‘rb’)

Any idea on how to achieve this in streamlit pdf reading?

I tried searching this forum for the same and found this post helpful. But it does not mention much about string literals.

Based on what I see, the PdfFileReader Class (link) and the st.file_uploader widget (link) has no parameters (if I’m not wrong) to convert to ‘r’ and ‘rb’.

In this case, I’m unsure how to continue. It would be quite beneficial to understand more about this subject. Any assistance or pointers are greatly appreciated!

Thank you,

I found the solution to the issue of raw string literal (and not the binary format). We just have to use ‘/’ instead of r’\path’ and my guess is it is internally taken care by file uploader widget for streamlit.

file = open( “C:/Users/path/ebook.pdf”, ‘rb’ )

The query on binary format read still exists.

Hi @billysilly -

In the examples section of st.file_uploader, you’ll see various ways of using the data provided after it is uploaded. If I were to guess, I suspect the real issue is that you need to call pdf_reader = pdf.PdfFileReader(uploaded_file.getvalue()), which will provide the data as raw bytes.

Best,
Randy

1 Like

Thanks Randy @randyzwitch! I’ll go through the docs and work on your suggestion. Will update you soon