Running ETL Jobs using Streamlit

Hello Streamlit Community,

I wanted to share one use case of how I’ve got Streamlit to help me in my day to day job.
I have a number of jobs created on the ETL Server (MSSQL) and these jobs need to be executed manually on demand. While I like developing solutions, I do not like to manually run processes, which is why I was looking for a simple way to make this “manual job execution” self-serve and hand it off to a non-programmer.

Enter Streamlit. Using < 100 lines of python code I created a simple app which allows the user to run jobs across multiple clients and view the progress of any job.

If anyone has a similar need and would like more details/code, please let me know.

Thanks!

6 Likes

Would be great to see a screenshot of how you implemented this, cheers!

Hi Negmat
This is interesting, one of my colleagues has a need for a similar application and a few more details would be a great starting point for him.
Kevin

Hi Rob and Kevin,

Would it help if I provided full code (just copy/paste into here)?

Thanks!

2 Likes

Hi Kevin,

Here are the design details: We develop ETL processes in SSIS. SSIS Projects are deployed to SSISDB Database on the ETL Server. SQL Server Agent jobs are created to run a master SSIS package (within the SSIS Project) which in turn runs all the other SSIS packages. As the jobs run, in addition to doing data processing, there is also audit information written to the database.

I use streamlit to essentially run these jobs and view their progress. I have a configuration file in excel which contains information on clients and corresponding jobs (along with SQL Server connection information). Configuration information feeds the drop downs so a user can choose which jobs to start or stop. We have a view sitting on top of the audit tables and this is what is used to view job progress (essentially showing which SSIS packages have run, in progress, or errorred out).

I am happy to provide more details or code.

Thank you!

1 Like

I would love to see the code! :eyes:

Hi Sergio,

Below is the code:


import streamlit as st

import pandas as pd

import sqlalchemy

import urllib.parse

general_folder = “C:\\Web Apps”

config_file_name = “Client Data Processing Configuration.xlsx”

config_file_location = general_folder + ‘\’ + config_file_name

Load config file into the dataframe

xls = pd.ExcelFile(config_file_location)

config_df = pd.read_excel(xls, “Jobs”)

st.title(“Self Serve Data Processing”)

Identify a client

client_list = list(set(config_df[“Client Name”]))

client = st.selectbox(“Client”, client_list)

Identify the client job to run

config_df_client = config_df[config_df[“Client Name”] == client]

data_processing_jobs_for_client = list(config_df_client[“Data Processing Job”])

data_processing_job = st.selectbox(“Data processing Job”, data_processing_jobs_for_client)

Identify client and etl database server and database

client_db_server = list(config_df_client[“Database Server”])[0]

client_database = list(config_df_client[“Database Name”])[0]

etl_db_server = list(config_df_client[“ETL Server”])[0]

etl_database = “msdb”

def get_connection(db_server, database):

# Database connection

connection_string = (

    r'Driver={SQL Server Native Client 11.0};'

    r'Server=' + db_server + ';'

    r'Database=' + database + ';'

    r'Trusted_Connection=yes;'

    )

params = urllib.parse.quote(connection_string)

engine = sqlalchemy.create_engine(

    "mssql+pyodbc:///?odbc_connect=%s" % params

    )

connection = engine.raw_connection()

return connection

client_connection = get_connection(client_db_server, client_database)

etl_connection = get_connection(etl_db_server, etl_database)

if st.button(“Start Job”):

etl_connection.execute(

    "SET NOCOUNT ON; EXEC sp_start_job N'"+data_processing_job

    + "'; SET NOCOUNT OFF"

)

st.success("The job has started successfully")

if st.button(“Stop Job”):

etl_connection.execute(

    "SET NOCOUNT ON; EXEC sp_stop_job N'"+data_processing_job

    + "'; SET NOCOUNT OFF"

)

st.success("The job has stopped successfully")

if st.button(“View Job Progress”):

sql = "SELECT * FROM [audit].[PackageStatusForLatestBatch] ORDER BY [Package Start Time]"

audit_view = pd.read_sql_query(sql, client_connection)

st.dataframe(audit_view)

Things to note:

  1. PackageStatusForLatestBatch is a view built on top of the audit tables which shows SSIS package run progress. Those audit tables are populated by the SSIS packages. If you’d like to know how that works, it is a separate discussion on SSIS Management Framework. Alternatively, you can replace with your own way of identifying/showing progress.

  2. There is a configuration excel file (“Client Data Processing Configuration.xlsx”), which contains information on clients and jobs per client as well as database server information. This excel file has 1 tab (Jobs), and within this tab the columns are:

Hope this helps.

Any questions, let me know.

Thanks!

5 Likes

Thanks for sharing! this is great

Thanks for sharing, I am working on a similar project, but I need to load different third party data files from Excel/CSV/Google Sheets.

For each of the third party data files, I have a python script to pick up the data file from the respective folders (either shared or Google Drive) and load to respective SQL table. The python script I created to load the data files to the SQL tables works perfectly.

I created a small Streamlit Web App to load the data when we receive the data file from the third party.

How do I to execute the corresponding python script by selecting the clients names from Streamlit WebApp client select dropdown box?

Thank you


Hi Deepap - thank you for your question!

The answer to your question depends on how “configurable” you want your solution to be. For example, in the simplest case (not configurable), you could store the selected “Client” in the client variable and then say:

if client == “Client 1”:
do something (which is client 1 python script you’d like to run)
elif client == “Client 2”:
do something (which is client 2 python script you’d like to run)
etc.

If you want it to be more configurable/dynamic, then you need to store the mapping between your clients and script in a configuration file/table and then look up/execute the script on client selection.
For example, I have a solution where I created my scripts within functions, and then used “eval” function to execute the functions on demand.
I have another solution where I associated clients with ETL jobs within a configuration table, and have streamlit start jobs on demand, based on the client.

I hope this gives you some ideas and if you have any more questions, please let me know.

Thanks!

1 Like

Thank you so much for your reply. I am trying to use the non-configurable method. If I’m able to execute the python script, which should read and clean the data files and then load in the SQL table, I’d like to use your configurable/dynamic method to select more clients/vendors data files. Do you know where I can find some literature/examples regarding the eval function or where I can find the solution that you mentioned regarding the eval function?

Thanks

Hi Deepap,

Below is how I use this function:

eval(python_function_name)(parameter 1, parameter 2)

Assuming you have a table with client -> python function name mapping, then once the client is chosen, you can lookup the function name, and execute it using eval function.

The only constraint in this approach is that the parameters cannot vary from function to function.

Do you plan to use a database table or an excel file as your configuration reference?

Thanks!

Thanks for the reply.
I getting this error - “TypeError: eval() arg 1 must be a string, bytes or code object”
Attaching the screen shot. Please let me know what is the mistake I am doing here with the eval () in calling my python script. If this works I am planning to create a table to store the client name and the corresponding python script.


Thanks

Hi Deepap,

Are you calling a function within the same script or the name of the separate python script in the eval function?

Thanks!

its a separate python script which picks up the files from shared Drive to a SQL Table

Hi Deepap,

The way I’ve designed the application is that I have everything in a single python script. There is a main function, and other functions. And within the main function, I am using eval to call other functions.

I have not used eval to call other python scripts.

Thanks!

Mr. Negmat_Mulodzhanov thank you for the help.
Yes, for my use case I need to call a function from another python script outside the current script.
I you know any way, please let me know, I will work on the possible solution.
Again thanks for your time and help.

Hi Deepap,

I have not tried it, but see online that if you create a function within another script, after you import this script, you can call the function. I am guessing it should work through eval, but you’d have to try to confirm.

Thanks!