How to use pymupdf to read a pdf after uploading that via st.file_uploader()?

Soumyadip_Sarkar · November 18, 2020, 8:30pm

Hi everybody, its a sort of how to do question.
I tried to use pymupdf to read a pdf after uploading that vis st.file_upload(), but its giving me this error,

RuntimeError: cannot open <streamlit.uploaded_file_manager.UploadedFile object at 0x0000021208CC1CA8>: Invalid argument

Trace back:

File "d:\users\user\anaconda3\lib\site-packages\streamlit\script_runner.py", line 324, in _run_script
    exec(code, module.__dict__)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 230, in <module>
    main()
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 197, in main
    txt = read_pdf_with_fitz(docx_file)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 94, in read_pdf_with_fitz
    with fitz.open(file) as doc:
File "C:\Users\USER\AppData\Roaming\Python\Python37\site-packages\fitz\fitz.py", line 3523, in __init__
    _fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))

Code:

 import fitz  # this is pymupdf

 def read_pdf_with_fitz(file):
 	with fitz.open(file) as doc:
 		text = ""
 		for page in doc:
 			text += page.getText()
 		return text 

pdf = st.file_uploader("",type=['pdf'])
result = read_pdf_with_fitz(pdf)

PS: its not the exact code, but it’s pretty much it. and the error was coming from fitz.open() line.

Yes I know I can use pyPDF2 or pdfplumber to do that and even I am using pdfplumber for reading the file, but I am preferring Pymupdf because my project is related to NLP, so other packages reading the pdf in a very bad way and because of that I am not getting the desired o/p, but using pymupdf giving me better results. So, if anybody can help me by showing me how to read a pdf file using pymupdf after uploading the file, then it would be very helpful🙏.

python version:3.7
streamlit version: 0.69.2

randyzwitch · November 18, 2020, 9:11pm

From a brief reading of their docs, it appears that you are passing the BytesIO buffer from Streamlit using the filename argument (first keyword position), when you should be passing it in the stream argument:

doc = fitz.open(stream=mem_area, filetype="pdf")

Soumyadip_Sarkar · November 18, 2020, 10:46pm

I tried it, but its giving me this error,
ValueError: bad type: 'stream'
This is the trace back,

File "d:\users\user\anaconda3\lib\site-packages\streamlit\script_runner.py", line 324, in _run_script
    exec(code, module.__dict__)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 230, in <module>
    main()
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 197, in main
    txt = read_pdf_with_fitz(docx_file)
File "D:\Documents\My_projects\Project Resume Analyzer\resume_st.py", line 94, in read_pdf_with_fitz
    with fitz.open(stream=file, filetype="pdf") as doc:
File "C:\Users\USER\AppData\Roaming\Python\Python37\site-packages\fitz\fitz.py", line 3505, in __init__
    raise ValueError("bad type: 'stream'")

andfanilo · December 13, 2020, 10:14am

Hello @Soumyadip_Sarkar, I think you were missing the read() to read file as bytesIO which pymupdf can then consume.

For future reference, the following works:

import fitz
import streamlit as st

uploaded_pdf = st.file_uploader("Load pdf: ", type=['pdf'])

if uploaded_pdf is not None:
    with fitz.open(stream=uploaded_pdf.read(), filetype="pdf") as doc:
        text = ""
        for page in doc:
            text += page.getText()
        st.write(text)

I’m not sure fitz.open() context manager always closes the file as I got some AttributeError: 'Document' object has no attribute 'isClosed' error so I closed the buffer manually too:

import fitz
import streamlit as st

uploaded_pdf = st.file_uploader("Load pdf: ", type=['pdf'])

if uploaded_pdf is not None:
    doc = fitz.open(stream=uploaded_pdf.read(), filetype="pdf")
    text = ""
    for page in doc:
        text += page.getText()
    st.write(text) 
    doc.close()

ANAND_VERMA · March 5, 2021, 2:33am

How can I extract images from a pdf of images uploaded via streamlit

Topic		Replies	Views
Editing user-submitted PDF with PyMuPDF then making it available for download Using Streamlit file-upload , file-download	6	2778	May 8, 2023
Cannot read PDF from st.upload with PyMuPDF Using Streamlit file-upload , debugging	2	592	May 28, 2024
Issue with pymupdf on SCC Community Cloud	2	560	May 11, 2024
Unable to import fitz (PyMuPDF) Using Streamlit	2	4028	January 22, 2024
PDF Reader problems Using Streamlit	21	4759	March 14, 2025

How to use pymupdf to read a pdf after uploading that via st.file_uploader()?

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies