How to upload a .pdf file in streamlit and then process it futher to extract the information .
Hello @Gyanaranjan_pathi, welcome to the Streamlit forums
- On the uploading part, you can use Streamlitâs file_uploader to display a file uploader on your app, as such :
import streamlit as st
uploaded_file = st.file_uploader('Choose your .pdf file', type="pdf")
if uploaded_file is not None:
df = extract_data(uploaded_file)
- Then your PDF upload will be available as a StringIO object in the
uploaded_file
variable, so now to extract data from the PDF, you will need a Python library that can read your pdf as StringIO or a filelike object.
I used pdfplumber to extract tables from PDFs in one of my Streamlit apps, pdfplumber.load
accepts StringIO so you can do :
def extract_data(feed):
data = []
with pdfplumber.load(feed) as pdf:
pages = pdf.pages
for p in pages:
data.append(p.extract_tables())
return None # build more code to return a dataframe
but there are multiple other librairies like camelot, tabula-py or pdfminersix and I had to test multiple ones for my use case before going with pdfplumber so you may need to test multiple ones too depending on the info you need to extract !
Hope this helps
Thank you @andfanilo
@andfanilo, I came across this discussion while looking for PDF file upload and analysis. I am working on PDF files using âpdfminer.sixâ . I could not find anything in documentation to load file(st.file_uploader object) like you mentioned for pdfplumber.
Any suggestions on handling pdf files using pdfminer.six library in streamlit app will be very helpful. Thanks:)
Donât have a lot of experience with pdfminer.six but at least the following seems to work with Streamlitâs file uploader:
import pdfminer
from pdfminer.high_level import extract_pages
import streamlit as st
st.write(pdfminer.__version__)
uploaded_file = st.file_uploader("Choose a file", "pdf")
if uploaded_file is not None:
for page_layout in extract_pages(uploaded_file):
for element in page_layout:
st.write(element)
Hope this can serve as a good starting point.
Fanilo
How can I extract images from a pdf of images
Hi @andfanilo , I am working on a use-case of extracting tabular data from pdf files. For this, I am using camelot as a table extraction library. How to parse the pdf uploaded through st.file_uploader() and pass it to camelot. As per my understanding from camelot documentation, camelot.read_pdf() only accepts file path as input.
Hello @santosh_boina
If it absolutely requires a filepath and not a File-related object, you could try to write the uploaded file in a temporary folder and provide camelot
with the URl to said file, then destroy the temporary file at the end of the job. You can copy the following bit of code:
Hope this gets you started!
Fanilo
Hello @santosh_boina, you might have better luck with this bit of code instead, which fixes a bug in the former:
How to render a pdf file in streamlit?
def show_pdf(file_path):
with open(file_path,ârbâ) as f:
base64_pdf = base64.b64encode(f.read()).decode(âutf-8â)
pdf_display = Fââ
st.markdown(pdf_display, unsafe_allow_html=True)
print(âDoneâ)
show_pdf('C:/Users/Tarun/Downloads/SOFTWARE ENGINEERING NOTES.pdf')
I tried this, but neither pdf displays nor any errror msg. Pls helpâŠ
Did you find a solution?
no, it doesnât work
well, pdfplumber.load just doesnât work
You can simply upload a pdf file and open it using pdfplumber
import pdfplumber
import streamlit as st
uploaded_file = st.file_uploader("Choose a file")
if uploaded_file is not None:
st.success("Uploaded the file")
with pdfplumber.open(uploaded_file) as file:
all_pages = file.pages
st.write(all_pages[0].extract_text()) # you can print and check the data from any page in pdf
Here is an example of how I used the pymupdf4llm library to convert the contents of a PDF file into markdown format. I hope this helps!
# pdf file upload
pdf_file = st.file_uploader('Upload a PDF file', type=['pdf'])
if pdf_file is not None:
bytes_data = pdf_file.read()
# create a temporary file
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
tmp_file.write(bytes_data)
temp_file_path = tmp_file.name
md_text = pymupdf4llm.to_markdown(temp_file_path)
st.markdown(md_text)