Summary
- @st.cache not working as expected.
- I canât figure out how to show a default text âSelect your asset:â instead of the first item from the dataframe in an st.selectbox.
- I canât figure out how to get scraped URLs ready for user selection.
Long version:
I am trying to create a scrapping script written in python using Selenium. I created two .py files that work the way they should and now Iâm trying to bring them together into one streamlit app (dapp02.py) as python functions. (âScrape_Dubai_URLs.pyâ named as âdef get_urls()â and âScrape_Properties.pyâ named as âdef scrape_properties()â).
[dapp02.py] (Exelium/dapp02.py at main ¡ arjunpsk/Exelium ¡ GitHub )
[Scrape_Dubai_URLs.py] (Exelium/Scrape_Dubai_URLs.py at main ¡ arjunpsk/Exelium ¡ GitHub)
[Scrape_Properties.py] (Exelium/Scrape_Properties.py at main ¡ arjunpsk/Exelium ¡ GitHub)
The first function (def get_urls(filename)) is to go to a URL (Letâs call it âURL_Baseâ for the description of this problem) and scrape all the links (Called âurl_listâ in the code) from the page.
The idea behind this is to show the scrapped list to the user then for them to choose the link they want to scrape from.
When I first run the Streamlit app through the terminal, the function is invoked, and selenium does what itâs supposed to.
The st.selectbox(âSelect your asset:â, url_list) shows the list in the app, and while it only shows the âNameâ from the âreturn url_listâ it definitely is cleaner than showing the âNameâ, âURLâ and âPropertiesâ.
However, after this point is where I canât for the life of me figure out how to solve some of the issues Iâve encountered.
For instance, the def get_urls(filename) needs to be only run once so I added a â@st.cache(suppress_st_warning=True)â before the function. This doesnât work for some reason. If the user selects anything else from the selectbox after the initial run, the code starts again from the top, ignoring the @st.cache. This results in the same scrapping repeating over and over instead of just showing the cached select box items.
Secondly, I canât figure out how to show a default value, âSelect your asset:â instead of the first value from the returned get_urls().
Finally, I canât understand how I could use the user-selected URL from the selectbox through the def scrape_properties() function.
Iâve spent a fair amount of time on this with no solution. I would greatly appreciate any and all help to solve this.
Steps to reproduce
Code snippet:
import streamlit as st
import pandas as pd
import numpy as np
import csv
import subprocess
import sys
from st_aggrid import AgGrid
import os
from array import array
import pandas as pd
from IPython.display import display
from datetime import datetime
# Import selenium webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
# Waiting
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
service = ChromeService(executable_path = r'./Driver/chromedriver')
url = "https://www.bayut.com/for-sale/property/abu-dhabi/"
header = st.container()
cont_url = st.container()
cont_properties = st.container()
@st.cache(suppress_st_warning=True)
def get_urls(filename):
driver = webdriver.Chrome(service=service, options=options)
wait = WebDriverWait(driver, 1)
st.write("Cache miss: Hey Arjun, Function ran even though @st.cache is defined.")
# try:
driver.get(url)
driver.maximize_window() # For maximizing window
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "f1ab04e0"))).find_element(By.CLASS_NAME, "_44977ec6").click()
locations = []
listings = wait.until(EC.presence_of_element_located((By.CLASS_NAME , "b7a55500"))).find_elements(By.CLASS_NAME, "_1c4ffff0") # works
for listing in listings:
wait.until(EC.presence_of_element_located((By.CLASS_NAME, '_7afabd84')))
List_dict={
'Name':listing.find_element(By.CLASS_NAME, '_9878d019').text,
'URL':listing.find_element(By.CLASS_NAME, '_9878d019').get_attribute("href"),
'Properties':listing.find_element(By.CLASS_NAME, '_1f6ed510').text
}
locations.append(List_dict)
url_list = pd.DataFrame(locations, columns=['Name', 'URL', 'Properties'])
url_list.to_csv(filename, mode='w', index=False, header=True)
driver.close()
driver.quit()
return url_list # this is the addition for the streamlit app.
with cont_url:
st.title('Properties for sale')
url_list = get_urls('Location_URLs_.csv')
st.write("Calling Arjun's get_urls() function.")
option = st.selectbox('Select your asset:', url_list)
st.write('You selected:', option)
st.table(url_list) # to see what the url_list is returning.
Debug info
- Python - version 3.10.8
- Streamlit - version 1.13.0
- conda - version 22.9.0
- selenium - version 4.5.0
- chrome - Version 106.0.5249.119 (Official Build) (arm64)
- Mac OS Monterey 12.2.1
Requirements file
- ipython==8.5.0
- numpy==1.23.4
- pandas==1.5.1
- python-dotenv==0.21.0
- selenium==4.5.0
- streamlit==1.13.0
- undetected_chromedriver==3.1.6