Extracting data from PDF display

Hi everyone,

Iโ€™m trying to display a fillable PDF form on a webpage using Streamlit and extract the filled fields from the form. Iโ€™m currently using the following code to display the PDF file in an iframe using st.markdown:

import streamlit as st
import base64
import PyPDF2

pdf_data = open("template.pdf", "rb").read()

b64 = base64.b64encode(pdf_data).decode("utf-8")
pdf_display = f'<iframe src="data:application/pdf;base64,{b64}" width="700" height="1000" type="application/pdf"></iframe>'
st.markdown(pdf_display, unsafe_allow_html=True)

if st.button("extract"):
    # Get the form fields from the PDF file
    pdf_reader = PyPDF2.PdfReader(pdf_data)
    fields = pdf_reader.get_fields()
    # Convert the fields to a dictionary
    fields_dict = {}
    for field in fields:
        fields_dict[field] = fields[field].get("/V")
    # Save the fields as a JSON file
    with open("fields.json", "w") as f:
        json.dump(fields_dict, f)

However, Iโ€™m having trouble extracting the filled fields from the PDF form. I believe this is because the PDF file is being displayed in an iframe using st.markdown, and the filled fields are not being captured in the pdf_data variable.

I would appreciate any help on how to extract the filled fields from the PDF form displayed in the iframe using Streamlit or any other tools. If there are any other potential solutions besides Streamlit, I would be happy to explore those as well.

Thank you in advance for your help!

I hope this helps! Let me know if you have any further questions.

Running locally will be deployed.
Streamlit 1.31.0 and Python 3.11.7

Hello @Learner12,

Hereโ€™s a basic example of how you might start converting a PDF form into a Streamlit web form

import streamlit as st
import json

# Example form fields
name = st.text_input("Name")
age = st.number_input("Age", step=1)
gender = st.selectbox("Gender", ["Male", "Female", "Other"])
feedback = st.text_area("Feedback")

if st.button("Submit"):
    form_data = {
        "Name": name,
        "Age": age,
        "Gender": gender,
        "Feedback": feedback
    # Process the data as needed
    st.write("Form Submitted Successfully!")
    # Optionally, save the data to a file
    with open("form_data.json", "w") as f:
        json.dump(form_data, f)

Hope this helps!

Kind Regards,

P.S. Lets connect on LinkedIn!

I am trying to use a fillable pdf form since there are a lot of pdfs and each pdf form has additional information that goes with the input fields so I canโ€™t simply use st.text_input() for all the form fields.
I need someway to either extract the data from the iframe once the form has been filled out or someway to download the form when the user clicks a button and then extract the fields from the downloaded form with PyPDF2. The problem Iโ€™m having is when I try to download the iframe content it downloads the original empty form file instead of capturing the filled form.

This is what I tried to capture the data from the iframe:

const iframe = document.querySelector('iframe');
        // Use the `removeAttribute()` method to remove the `allow-scripts` attribute
        // iframe.removeAttribute('allow-scripts');

        // Use the `removeAttribute()` method to remove the `allow-same-origin` attribute
        // iframe.removeAttribute('allow-same-origin');
        const pdfUrl = iframe.src;

        // Use the `fetch` method to download the PDF file as a blob
            .then(response => response.blob())
            .then(blob => {
            // Replace the filename with the name you want to give to the PDF file
            const filename = 'filled_form.pdf';

            // Use the `URL.createObjectURL` method to create a URL for the blob
            const url = URL.createObjectURL(blob);

            // Use the `download` attribute to download the PDF file
            const link = document.createElement('a');
            link.download = filename;
            link.href = url;

            // Use the `URL.revokeObjectURL` method to release the URL