Website Text Extractor

This app may need some enhancement and may contain errors

for exemple the text it’s not very well parsed in the pdf

Documentation: Website Text Extractor

The provided code is a Python script that extracts the text from a website and provides a user interface to interact with the extraction process. It utilizes several libraries, including requests, BeautifulSoup, streamlit, io, re, PyPDF2, and reportlab.

Dependencies

Make sure you have the following libraries installed before running the code:

  • requests
  • beautifulsoup4
  • streamlit
  • PyPDF2
  • reportlab

You can install these dependencies using pip:

pip install requests beautifulsoup4 streamlit PyPDF2 reportlab

Code Explanation

Importing Required Libraries

import requests
from bs4 import BeautifulSoup
import streamlit as st
from io import BytesIO
import re
from PyPDF2 import PdfWriter
from reportlab.pdfgen import canvas

The code begins by importing the necessary libraries. These libraries are used for making HTTP requests, parsing HTML content, creating a user interface, working with PDF files, and manipulating strings.

Extracting Text from a Website

def extract_text_from_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.title.string.strip()
    text = soup.get_text(separator=' ')
    text = "\n".join(line for line in text.splitlines() if line.strip())
    return text, title

This function extract_text_from_website(url) takes a URL as input and returns the extracted text and title of the webpage. It uses the requests library to send a GET request to the specified URL and retrieves the HTML content. The BeautifulSoup library is then used to parse the HTML and extract the webpage title and all the text content. The extracted text is processed to remove blank lines and returned along with the webpage title.

Streamlit Web Interface

def main():
    st.title("Website Text Extractor")
    url = st.text_input("Enter the URL of the website:")
    if st.button("Extract Text"):
        if url:
            try:
                extracted_text, webpage_title = extract_text_from_website(url)
                st.success("Text extraction successful!")
                st.text_area("Extracted Text:", value=extracted_text, height=400)
                pdf_bytes = BytesIO()
                pdf_writer = PdfWriter()
                c = canvas.Canvas(pdf_bytes)
                c.setFont("Helvetica", 12)
                c.drawString(50, 800, extracted_text)
                c.save()
                pdf_bytes.seek(0)
                file_name = re.sub(r'[\\/:"*?<>|]+', '_', webpage_title) + ".pdf"
                st.download_button("Download", data=pdf_bytes, file_name=file_name)
            except Exception as e:
                st.error("An error occurred during text extraction.")
                st.error(str(e))
        else:
            st.warning("Please enter a URL.")

The main() function is the entry point of the script. It uses the streamlit library to create a web interface. The web interface consists of a title and a text input field where the user can enter the URL of the website they want to extract text from.

When the user clicks the “Extract Text” button, the code inside the if st.button("Extract Text"): block is executed. First, it checks if a URL has been entered. If a URL is provided, the extract_text_from_website(url) function is called to extract the text and webpage title. The extracted text is then displayed in a text area using st.text_area().

Next, a PDF file is generated containing the extracted text. The reportlab library is used to create a PDF canvas and write the text onto it. The resulting PDF is stored in a BytesIO object. The filename for the PDF is derived from the webpage title by replacing any invalid characters with underscores using regular expressions.

Finally, a download button is displayed using st.download_button(), allowing the user to download the generated PDF file. If any errors occur during the text extraction process, appropriate error messages are displayed using st.error() and st.warning().

Running the Script

if __name__ == "__main__":
    main()

This conditional statement checks whether the script is being run directly (as opposed to being imported as a module) and calls the main() function to start the web interface.

Usage

  1. Ensure that the required dependencies are installed.
  2. Run the script using python script_name.py.
  3. Access the web interface in your browser at the provided URL (usually http://localhost:8501).
  4. Enter the URL of the website you want to extract text from in the provided input field.
  5. Click the “Extract Text” button to initiate the extraction process.
  6. The extracted text will be displayed in a text area.
  7. Click the “Download” button to download the extracted text as a PDF file.

Note: The script assumes a valid URL is entered, and the webpage allows scraping of its content. Some websites may have protections in place to prevent scraping, which may cause the script to fail or retrieve incomplete text.

Result

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.