Company
Flowspark
## πΌ Title
Streamlit (Python) developer to build an interactive PDF text extraction application
Apply here
Location
Remote
Job Description
We are seeking an experienced Streamlit (Python) developer to build an interactive PDF text extraction application that allows users to visualize the relationship between PDF documents and their extracted textual content. The application will feature a dual-column interface with real-time interactive highlighting capabilities between the original document and extracted text.
Core Requirements and Functionality.
The project is already in progress, but the developer will work on some feature implementation.
PDF Processing and Text Extraction.
Develop a Streamlit application that exclusively accepts PDF file uploads.
Implement text extraction functionality from PDF documents using PyMuPDF or similar libraries, including OCR for scanned pages.
Build capability to extract structured data (JSON fields) from PDF documents.
Support multi-page PDF processing with appropriate UI considerations.
Interactive Dual-Column Interface
Create a two-column layout: the left column displays the original PDF, the right column shows the extracted text.
Implement bidirectional interactive highlighting text is selected in either column, the corresponding text in the other column is automatically highlighted.
Ensure visual consistency and responsiveness across different PDF layouts and content types.
User Experience
Design an intuitive interface with clear upload mechanisms and processing indicators
Implement effective error handling for invalid files, processing failures, etc.
Create a responsive design that maintains functionality across different screen sizes.
Develop clear documentation for using the application.
Technical Skills Required
Proficient in Python programming with demonstrated Streamlit application development experience.
NLP and LLM e.g Grobid.
Experience with PDF and OCR processing libraries, like PyMuPDF (fitz), Tesseract.
Strong understanding of document processing and text extraction techniques.
JSON/CSV formatting, data matching logic
Clear, proactive communication.