Key Error in Pandas df when trying to 'drop' rows

Hi,
I’m doing some data analysis with pandas, matplotlib, seaborn, streamlit.
When running locally, everything works great. When deploying to the Streamlit Cloud, numerous issues popped up (technically, Streamlit Cloud + Github (link to repo below) + Google Drive, as it’s a large file). So I’ve been working with ChatGPT and fixed a lot of issues but finally got stuck on a KeyError.

This is the line of code with issues:

df.drop(df[df['CONTRIBUTING_FACTOR_VEHICLE_1'].isin(['Unspecified', 'Other Vehicular'])].index, inplace=True)

Python underlines just this part: ‘CONTRIBUTING_FACTOR_VEHICLE_1’

Without providing too much information right now, can anyone spot what could be the issue? (full script at bottom).

  1. deployed on Community Cloud

  2. If your app is deployed:
    (https://apptrafficaccidentanalysis-yewxq4rseyvphhq6rrjhdo.streamlit.app/)

  3. GitHub - mike-ua/Streamlit_Traffic_Accident_Analysis

  4. KeyError: ‘CONTRIBUTING_FACTOR_VEHICLE_1’

  5. It looks like the Streamlit Cloud runs Python 3.11, but can’t find which version of Streamlit it’s running.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import streamlit as st


# Part I: create a horizontal bar chart based on finding the top 12 causes (by count of occurances) of traffic accidents to display in a Streamlit app 

# Create a direct download URL
file_id = '1DW_oMscvurVmphPDuVYLQ2r0C9co59d8'
download_url = f'https://drive.google.com/uc?export=download&id={file_id}'

# Read the CSV file from Google Drive
df = pd.read_csv(download_url, sep=';', header=0, encoding='utf-8')

# Drop multiple rows based on a list of values
df.drop(df[df['CONTRIBUTING_FACTOR_VEHICLE_1'].isin(['Unspecified', 'Other Vehicular'])].index, inplace=True)

# Count the occurrences of each contributing factor
factor_counts = df['CONTRIBUTING_FACTOR_VEHICLE_1'].value_counts()

# Get the top 12 contributing factors
top_factors = factor_counts.head(12)

# Convert the Series to a DataFrame for easier plotting with Seaborn
top_factors_df = top_factors.reset_index()
top_factors_df.columns = ['Contributing Factor', 'Count']

# Streamlit app setup
st.title('Top 12 Contributing Factors for Traffic Accidents')
st.header('New York (2013 - 2023)', divider="gray")

# Create the bar plot
fig, ax = plt.subplots(figsize=(20, 15))
sns.barplot(x='Count', y='Contributing Factor', data=top_factors_df, palette='viridis_r', ax=ax)

# Set the title and labels
#ax.set_title('Top 12 Contributing Factors for Traffic Accidents in New York (2013 - 2023)')
ax.set_xlabel('Total Accidents')
ax.set_ylabel('Contributing Factor')

# Rotate the x labels for better readability
plt.xticks(rotation=0)

# Display the plot in Streamlit
st.pyplot(fig)

# Cite source
st.text('source: https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes')


st.header("Can 'Self-Driving Cars' help reduce accidents? ")
st.markdown("**Specifically:** which categories of accidents could be reduced the most?")
st.markdown("To simplify, I grouped the 12 contributing factors into two categories: ")
st.markdown("1. Technical / Skills / Other, or, **EMOTION NEUTRAL** events, which include:")
st.markdown("- 'Backing Unsafely', 'Fatigued/Drowsy' and 'Driver Inexperience'")
st.markdown("2. Driver Attitude, or **EMOTION FUELED** events, which include:")
st.markdown("- 'Failure to Yield Right-of-Way', 'Following Too Closely', 'Passing or Lane Usage Improper', 'Passing Too Closely' , 'Turning Improperly', 'Unsafe Lane Changing', 'Traffic Control Disregarded', and 'Unsafe Speed' ")
st.markdown("'Driver Inattention/Distraction' is displayed separately for comparison.")


# Part II: group top 'contributing factors' and display as a Matplotlib pie chart on the same Streamlit app

# Create a direct download URL
file_id = '1DW_oMscvurVmphPDuVYLQ2r0C9co59d8'
download_url = f'https://drive.google.com/uc?export=download&id={file_id}'

# Read the CSV file from Google Drive
df = pd.read_csv(download_url)

# Drop rows which do not list a 'contributing factor'
df.drop(df[df['CONTRIBUTING_FACTOR_VEHICLE_1'].isin(['Unspecified', 'Other Vehicular'])].index, inplace=True)

# Define a dictionary to map original categories to new groups, including 'Driver Inattention/Distraction'
category_map = {
    'Failure To Yield Right-Of-Way': 'Driver Attitude',
    'Following Too Closely': 'Driver Attitude',
    'Passing Or Lane Usage Improper': 'Driver Attitude',
    'Passing Too Closely': 'Driver Attitude',
    'Turning Improperly': 'Driver Attitude',
    'Unsafe Lane Changing': 'Driver Attitude',
    'Traffic Control Disregarded': 'Driver Attitude',
    'Unsafe Speed': 'Driver Attitude',
    'Backing Unsafely': 'Technical/Skills',
    'Fatigued/Drowsy': 'Technical/Skills',
    'Driver Inexperience': 'Technical/Skills',
    'Driver Inattention/Distraction': 'Driver Inattention/Distraction',
}

# Standardize the column by trimming spaces and converting to title case
df['CONTRIBUTING_FACTOR_VEHICLE_1'] = df['CONTRIBUTING_FACTOR_VEHICLE_1'].str.strip().str.title()

# Filter the DataFrame to only include rows with contributing factors in the category_map keys
df_filtered = df[df['CONTRIBUTING_FACTOR_VEHICLE_1'].isin(category_map.keys())]

# Replace the contributing factor categories with the new groups
df_filtered['Grouped Factor'] = df_filtered['CONTRIBUTING_FACTOR_VEHICLE_1'].replace(category_map)

# Count the occurrences of each grouped factor
grouped_factor_counts = df_filtered['Grouped Factor'].value_counts()

# Convert the Series to a DataFrame for easier plotting
grouped_factors_df = grouped_factor_counts.reset_index()
grouped_factors_df.columns = ['Grouped Factor', 'Count']

# Using grouped_factors_df setup debug statements
labels = grouped_factors_df['Grouped Factor']
sizes = grouped_factors_df['Count']

# Print the debug statements to verify values in Python match SQL statements
print("Labels:", labels)
print("Sizes:", sizes)
print("Total:", sum(sizes))
print("Percentages:", [size / sum(sizes) * 100 for size in sizes])


# Plot the pie chart
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, colors=['#40867f','#c3bd44','#434976'])
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.


# Display the plot in Streamlit
st.pyplot(fig)

st.header("Conclusion")
st.markdown("Clearly, 'Self-Driving Cars' can help reduce accidents due to 'Technical / Skills / Other' issues (157,182 accidents) however at 14.6% it is not a very significant portion of the total")
st.markdown("Similarly, 'Driver Inattention/Distraction' (417,718 accidents) can be reduced, this time however by a more substantial amount or 38.7%")
st.markdown("However, 'Driver Attitude' (504,948 accidents) is now the largest group at 46.8%")
st.markdown("Should the focus be on this group of accidents (**:red[e.g. 'I own the road, get out of my way!']**) to help promote the adoption of 'Self-Driving Cars'?")


st.header('Data Analysis Tools Used')
st.markdown('**Dataset:** approximately 2 million rows of information (a .csv file, approx. 450MB)')
st.markdown('**SQLite / DBeaver:** database / frontend for exploring the dataset with SQL')
st.markdown('**Python:** transforming the .csv into a .db file, creating the basic script, debugging')
st.markdown('**Pandas, Matplotlib:** Data manipulation, calculations, pie chart visualization')
st.markdown('**Seaborn:** creating the main visualization')
st.markdown('**Streamlit:** Python framework to share the script, e.g. web app')
st.markdown('**Github:** developer platform for hosting the script')
st.markdown('**Google Drive:** large file storage for the script to access the data')

Hey Mike,

Looks like the data is not being parsed correctly. When I pull the df these are the columns that are returned:

Index(['<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html',
       ' charset=utf-8"/><style nonce="wBDJ3M6WJKYRHIP9_Kmekw">.goog-link-button{position:relative',
       'color:#15c', 'text-decoration:underline',
       'cursor:pointer}.goog-link-button-disabled{color:#ccc',
       'text-decoration:none', 'cursor:default}body{color:#222',
       'font:normal 13px/1.4 arial,sans-serif',
       'margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px',
       'text-align:center}#uc-dl-icon{display:inline-block', 'margin-top:16px',
       'padding-right:1em', 'vertical-align:top}#uc-text{display:inline-block',
       'max-width:68ex',
       'text-align:left}.uc-error-caption,.uc-warning-caption{color:#222',
       'font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c',
       'text-decoration:none}.uc-name-size a:visited{color:#61c',
       'text-decoration:none}.uc-name-size a:active{color:#d14836',
       'text-decoration:none}.uc-footer{color:#777', 'font-size:11px',
       'padding-bottom:5ex', 'padding-top:5ex',
       'text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc',
       'width:100%}.goog-inline-block{position:relative',
       'display:-moz-inline-box',
       'display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel="icon" href="//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1DW_oMscvurVmphPDuVYLQ2r0C9co59d8">Crash_data_NY_update3.csv</a> (240M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="download-form" action="https://drive.usercontent.google.com/download" method="get"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/><input type="hidden" name="id" value="1DW_oMscvurVmphPDuVYLQ2r0C9co59d8"><input type="hidden" name="export" value="download"><input type="hidden" name="confirm" value="t"><input type="hidden" name="uuid" value="6725128a-7d26-4411-9b61-73da2ae59820"></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>'],
      dtype='object')

Maybe check that piece again locally and make sure the df you are constructing aligns with what you expect to see.

Hi Siavash,

Yea, I had ‘parsing’ issues earlier (in this ‘cloud’ deployment, not locally), so I looked at the file locally and it has all three boxes checked for separator (tab, comma and semicolon) so first tried a regex (sep=r’[\t,;]) that didn’t work, so just picked one, ‘sep=’;’ and that seemed to resolve the ‘parsing’ error, but also added the encoding and header params to be safe.

The code you pasted in looks like CSS/html (?) so yes, something is not correct. I’ve never used Google Drive as storage for a data file for a script in a Github repo so I suspect that’s where the main problem lies - I’m venturing into new territory. I believe I allowed permissions on the Google Drive file correctly (‘anyone with the link can view’). But I don’t know what else needs to be done for a Google Drive file to be used by a script in Github. Maybe I did not write the code correctly for downloading the Google Drive URL?

Previously I deployed a different script on Github but the data file it was using was much smaller so it was directly in a Github repo, all worked fine, no errors.

Locally, all works good.
Thanks!

I suspect I many be reaching the limits of a ‘free’ Google Drive account - the file has a little over 1.1 million rows and 29 columns :roll_eyes: