How to use pymupdf to read a pdf after uploading that via st.file_uploader()?

Hi everybody, its a sort of how to do question.
I tried to use pymupdf to read a pdf after uploading that vis st.file_upload(), but its giving me this error,

RuntimeError: cannot open <streamlit.uploaded_file_manager.UploadedFile object at 0x0000021208CC1CA8>: Invalid argument

Trace back:

File "d:\users\user\anaconda3\lib\site-packages\streamlit\script_runner.py", line 324, in _run_script
    exec(code, module.__dict__)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 230, in <module>
    main()
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 197, in main
    txt = read_pdf_with_fitz(docx_file)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 94, in read_pdf_with_fitz
    with fitz.open(file) as doc:
File "C:\Users\USER\AppData\Roaming\Python\Python37\site-packages\fitz\fitz.py", line 3523, in __init__
    _fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))

Code:

 import fitz  # this is pymupdf

 def read_pdf_with_fitz(file):
 	with fitz.open(file) as doc:
 		text = ""
 		for page in doc:
 			text += page.getText()
 		return text 

pdf = st.file_uploader("",type=['pdf'])
result = read_pdf_with_fitz(pdf)

PS: its not the exact code, but it’s pretty much it. and the error was coming from fitz.open() line.

Yes I know I can use pyPDF2 or pdfplumber to do that and even I am using pdfplumber for reading the file, but I am preferring Pymupdf because my project is related to NLP, so other packages reading the pdf in a very bad way and because of that I am not getting the desired o/p, but using pymupdf giving me better results. So, if anybody can help me by showing me how to read a pdf file using pymupdf after uploading the file, then it would be very helpfulπŸ™.

python version:3.7
streamlit version: 0.69.2

2 Likes

From a brief reading of their docs, it appears that you are passing the BytesIO buffer from Streamlit using the filename argument (first keyword position), when you should be passing it in the stream argument:

doc = fitz.open(stream=mem_area, filetype="pdf")

I tried it, but its giving me this error,
ValueError: bad type: 'stream'
This is the trace back,

File "d:\users\user\anaconda3\lib\site-packages\streamlit\script_runner.py", line 324, in _run_script
    exec(code, module.__dict__)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 230, in <module>
    main()
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 197, in main
    txt = read_pdf_with_fitz(docx_file)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 94, in read_pdf_with_fitz
    with fitz.open(stream=file, filetype="pdf") as doc:
File "C:\Users\USER\AppData\Roaming\Python\Python37\site-packages\fitz\fitz.py", line 3505, in __init__
    raise ValueError("bad type: 'stream'")

Hello @Soumyadip_Sarkar, I think you were missing the read() to read file as bytesIO which pymupdf can then consume.

For future reference, the following works:

import fitz
import streamlit as st

uploaded_pdf = st.file_uploader("Load pdf: ", type=['pdf'])

if uploaded_pdf is not None:
    with fitz.open(stream=uploaded_pdf.read(), filetype="pdf") as doc:
        text = ""
        for page in doc:
            text += page.getText()
        st.write(text) 

I’m not sure fitz.open() context manager always closes the file as I got some AttributeError: 'Document' object has no attribute 'isClosed' error so I closed the buffer manually too:

import fitz
import streamlit as st

uploaded_pdf = st.file_uploader("Load pdf: ", type=['pdf'])

if uploaded_pdf is not None:
    doc = fitz.open(stream=uploaded_pdf.read(), filetype="pdf")
    text = ""
    for page in doc:
        text += page.getText()
    st.write(text) 
    doc.close()
4 Likes

How can I extract images from a pdf of images uploaded via streamlit