Read Word Docs in Streamlit

Hi, is there a way to read, display and edit word documents within Streamlit?

(Available python packages can only create and modify word docx files in the background, but how can I show a properly formatted word file within streamlit?)

Thanks in advance

Cheers

3 Likes

Not sure if you ever got an answer for this. Suggested approach is to programmatically convert msdoc to markdown.

One approach:

import docx
import PyPDF2
import markdown

def convert_docx_to_markdown(input_file, output_file):
doc = docx.Document(input_file)
paragraphs = [p.text for p in doc.paragraphs]
markdown_text = ‘\n’.join(paragraphs)
with open(output_file, ‘w’, encoding=‘utf-8’) as f:
f.write(markdown_text)
print(“Conversion successful!”)

def convert_pdf_to_markdown(input_file, output_file):
with open(input_file, ‘rb’) as f:
pdf_reader = PyPDF2.PdfReader(f)
text = ‘’
for page in pdf_reader.pages:
text += page.extract_text()
markdown_text = markdown.markdown(text)
with open(output_file, ‘w’, encoding=‘utf-8’) as f:
f.write(markdown_text)
print(“Conversion successful!”)

Usage example

input_file = ‘input.docx’ # Replace with the path to your Word document or PDF file
output_file = ‘output.md’ # Replace with the desired output Markdown file

if input_file.endswith(‘.docx’):
convert_docx_to_markdown(input_file, output_file)
elif input_file.endswith(‘.pdf’):
convert_pdf_to_markdown(input_file, output_file)
else:
print(“Unsupported file format.”)

Another approach:

import subprocess

def convert_to_markdown(input_file, output_file):
try:
subprocess.run([‘pandoc’, ‘-s’, input_file, ‘-o’, output_file])
print(“Conversion successful!”)
except FileNotFoundError:
print(“Pandoc is not installed or not in the system path.”)
except Exception as e:
print(“An error occurred during conversion:”, str(e))

Usage example

input_file = ‘input.docx’ # Replace with the path to your Word document or PDF file
output_file = ‘output.md’ # Replace with the desired output Markdown file

convert_to_markdown(input_file, output_file)