Hello Community, I am currently building a simple app which does the following:
- User uploads one or more excel files using the the File Uploader.
- The uploaded files are converted into data frames and each file is validated according to different criteria, e.g.: The name of the columns with respect to the original template file cannot be altered.
- If all validations steps are past successfully then the records are updated or inserted into a SQLite file, else the user receives an error message explaining which validation step was violated.
I am currently working on step 2 (data validation) and decided to test the following:
- Upload 3 files at the same time where 1 file is corrupted (column names changed) and the two others are not.
- Upload only 1 file corrupted (column names changed)
- Upload only 1 file not corrupted
- Upload 1 file corrupted (column names changed) and 1 not corrupted and so on…
Unfortunately the test does not work all the time. I would say only around 90% of the time and of course it should work 100%, otherwise the users will not trust the application…
What I am exactly testing?
I am testing that if the name of the columns are not identical to the agreed column names, then the file cannot be processed. Nevertheless from time to time I receive the error message: “ValueError: Can only compare identically-labeled DataFrame objects”.
Please find below my code:
# Libraries import streamlit as st import pandas as pd with st.form("my-form", clear_on_submit=True): #Test multiple uploads uploaded_files_xlsx = st.file_uploader("Upload your XLSX file", type=["xlsx"],accept_multiple_files=True) submitted = st.form_submit_button("UPLOAD!") if uploaded_files_xlsx is not None: file_names =  dfs =  for f in uploaded_files_xlsx: # Add file names file_names.append(f.name) # Read and add all dfs data = pd.read_excel(f) dfs.append(data) # hard coded needed cols cols = ['Col_1', 'Col_2', 'Col_3', 'Col_4', 'Col_5', 'Col_6', 'Col_7', 'Col_8', 'Col_9', 'Col_10', 'Col_11', 'Col_12' ] # Validate structure of the input corrupted_structure =  st.write("Following data was uploaded:") for df,fn in zip(dfs,file_names): st.write(fn) if len(df.columns) != len(cols) or df.columns.tolist() != cols: corrupted_structure.append(fn) file_names.remove(fn) dfs.remove(df) # Inform about corrupted structure if corrupted_structure: st.write("##### The input template for following files was altered:") for n,i in enumerate(corrupted_structure): st.write("- "+ i) st.write("##### Possible Reasons:") st.write("- Additional columns were added. \n" "- Certain columns were omitted. \n" "- Name of the columns were changed.") st.write("Please adjust accordingly and upload the data again.Otherwise files cannot be processed.") if dfs: st.write("The following dfs are still available: ") for n,i in enumerate(file_names): st.write("- "+ i) else: st.write("No dfs in dfs") else: st.write("Upload your data")
I have already tried a lot of different things like changing the logic, using functions but nothing seem to work. Can there be some some kind of “caching” issues? (Even though the cache is clear on every run (clear_on_submit = True))
Any help will be highly appreciated,