Streamlit with Selenium scrapping is ignoring the @st.cache

Summary

  1. @st.cache not working as expected.
  2. I can’t figure out how to show a default text ‘Select your asset:’ instead of the first item from the dataframe in an st.selectbox.
  3. I can’t figure out how to get scraped URLs ready for user selection.

Long version:
I am trying to create a scrapping script written in python using Selenium. I created two .py files that work the way they should and now I’m trying to bring them together into one streamlit app (dapp02.py) as python functions. (“Scrape_Dubai_URLs.py” named as “def get_urls()” and “Scrape_Properties.py” named as “def scrape_properties()”).
[dapp02.py] (Exelium/dapp02.py at main · arjunpsk/Exelium · GitHub )
[Scrape_Dubai_URLs.py] (Exelium/Scrape_Dubai_URLs.py at main · arjunpsk/Exelium · GitHub)
[Scrape_Properties.py] (Exelium/Scrape_Properties.py at main · arjunpsk/Exelium · GitHub)

The first function (def get_urls(filename)) is to go to a URL (Let’s call it “URL_Base” for the description of this problem) and scrape all the links (Called “url_list” in the code) from the page.

The idea behind this is to show the scrapped list to the user then for them to choose the link they want to scrape from.
When I first run the Streamlit app through the terminal, the function is invoked, and selenium does what it’s supposed to.
The st.selectbox(‘Select your asset:’, url_list) shows the list in the app, and while it only shows the ‘Name’ from the ‘return url_list’ it definitely is cleaner than showing the ‘Name’, ‘URL’ and ‘Properties’.

However, after this point is where I can’t for the life of me figure out how to solve some of the issues I’ve encountered.

For instance, the def get_urls(filename) needs to be only run once so I added a “@st.cache(suppress_st_warning=True)” before the function. This doesn’t work for some reason. If the user selects anything else from the selectbox after the initial run, the code starts again from the top, ignoring the @st.cache. This results in the same scrapping repeating over and over instead of just showing the cached select box items.

Secondly, I can’t figure out how to show a default value, ‘Select your asset:’ instead of the first value from the returned get_urls().

Finally, I can’t understand how I could use the user-selected URL from the selectbox through the def scrape_properties() function.

I’ve spent a fair amount of time on this with no solution. I would greatly appreciate any and all help to solve this.

Steps to reproduce

Code snippet:

import streamlit as st
import pandas as pd
import numpy as np
import csv
import subprocess
import sys
from st_aggrid import AgGrid
import os

from array import array
import pandas as pd
from IPython.display import display
from datetime import datetime

# Import selenium webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By

# Waiting
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
service = ChromeService(executable_path = r'./Driver/chromedriver')

url = "https://www.bayut.com/for-sale/property/abu-dhabi/"

header = st.container()
cont_url = st.container()
cont_properties = st.container()

@st.cache(suppress_st_warning=True)
def get_urls(filename):
    driver = webdriver.Chrome(service=service, options=options)
    wait = WebDriverWait(driver, 1)
    st.write("Cache miss: Hey Arjun, Function ran even though @st.cache is defined.")
# try:    
    driver.get(url)
    driver.maximize_window() # For maximizing window
    
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "f1ab04e0"))).find_element(By.CLASS_NAME, "_44977ec6").click()

    locations = []

    listings = wait.until(EC.presence_of_element_located((By.CLASS_NAME , "b7a55500"))).find_elements(By.CLASS_NAME, "_1c4ffff0") # works
    for listing in listings:

        wait.until(EC.presence_of_element_located((By.CLASS_NAME, '_7afabd84')))

        List_dict={
        'Name':listing.find_element(By.CLASS_NAME, '_9878d019').text,
        'URL':listing.find_element(By.CLASS_NAME, '_9878d019').get_attribute("href"),
        'Properties':listing.find_element(By.CLASS_NAME, '_1f6ed510').text
        }

        locations.append(List_dict)

    url_list = pd.DataFrame(locations, columns=['Name', 'URL', 'Properties'])
    url_list.to_csv(filename, mode='w', index=False, header=True)
    driver.close()
    driver.quit()
    return url_list # this is the addition for the streamlit app.

with cont_url:
    st.title('Properties for sale')

    url_list = get_urls('Location_URLs_.csv')
    st.write("Calling Arjun's get_urls() function.")
    option = st.selectbox('Select your asset:', url_list) 
    st.write('You selected:', option)
    st.table(url_list) # to see what the url_list is returning.

Debug info

  • Python - version 3.10.8
  • Streamlit - version 1.13.0
  • conda - version 22.9.0
  • selenium - version 4.5.0
  • chrome - Version 106.0.5249.119 (Official Build) (arm64)
  • Mac OS Monterey 12.2.1

Requirements file

  • ipython==8.5.0
  • numpy==1.23.4
  • pandas==1.5.1
  • python-dotenv==0.21.0
  • selenium==4.5.0
  • streamlit==1.13.0
  • undetected_chromedriver==3.1.6

Links

dapp02.py
Scrape_Dubai_URLs.py
Scrape_Properties.py

Can you try st.experimental_memo instead of st.cache?

Docs here: st.experimental_memo - Streamlit Docs

You’re the best. You just saved me so much time with st.experimental_memo. The caching issue seems to have worked. I can now select from the selectbox and it doesn’t run the script from the start. Thanks a lot.

Do you have any tips for:

  • How to show a default text ‘Select your asset:’ instead of the first item from the scraped dataframe in the st.selectbox as seen in the screenshot?
  • Getting the user input through the selectbox and then running the URL that corresponds to the the Name through the next function def scrape_properties()?

Thanks once again.
Much respect,
A.

@ThisThatThenArjun

Here’s one way to add an extra placeholder item to a selectbox, and to get the corresponding row and url from the selected item:

df = pd.DataFrame(
    {
        "first_column": [1, 2, 3, 4],
        "second_column": [10, 20, 30, 40],
        "url": [
            "https://www.google.com",
            "https://www.streamlit.io",
            "https://www.python.org",
            "https://www.pandas.pydata.org",
        ],
    }
)

options = ["Select a thing"] + list(df["first_column"])

thing = st.selectbox("Select something", options)

if thing != "Select a thing":
    row = df[df["first_column"] == thing]
    st.write(row)
    url = row["url"].values[0]
    st.write(url)
1 Like

Thank you, @blackary, It works beautifully.