All in one Data Analysis app

This project is an interactive web application built using Streamlit, designed for data analysis and causation analysis. Users can upload datasets, perform data preprocessing, explore data visually, conduct regression and classification analysis, perform extensive data analysis, and explore causal relationships in their data.

To see the app running visit
GitHub - nstepka/Causation

Key Features

Data Uploading and Preprocessing

  • CSV File Upload: Easily upload your CSV datasets.

Preprocessing - Feature Enginerring

  • Data Preview: Get a quick overview of the uploaded data.
  • Handling Missing Values: Deal with missing data using various techniques.
  • Currency and Percentage Changes: Convert currency and percentage columns into numeric formats.
  • Column Removal: Drop unnecessary columns from the dataset.
  • Data Transformation: Apply a range of data transformations.
  • Data Encoding: Encode categorical variables using one-hot encoding.
  • Time Series Features: Convert date/time columns and create new features.
  • Binning/Bucketing: Discretize numeric variables into bins.
  • Custom Feature Engineering: Create new features based on custom logic.
  • Aggregate Columns: Perform aggregations and generate new columns.
  • Create Binary Flags: Generate binary flags based on specified conditions.
  • If you want to test all features, use listings.csv.

Data Exploration

  • Boxplot Visualization: Visualize the distribution of numeric data with boxplots.
  • Binary Distribution: Examine binary distribution using count plots.
  • Feature Comparison Graphs: Compare two variables with scatter and line plots.
  • Feature Importance: Assess feature importance for regression and classification tasks.
  • If you want to explore before using the regression model use df_selected1.csv

Regression Analysis

  • Regression Models: Train and compare various regression models, including Gradient Boosting Regressor, Random Forest Regressor, Linear Regression, and Decision Tree Regressor.
  • Model Evaluation: Compare models using R-squared scores and Mean Absolute Error (MAE).
  • Feature Importance: Visualize feature importance.
  • Heat Map: Display a heatmap to visualize correlations.
  • Prediction vs. Actual: Compare predictions against actual values.
  • Residual Plot: Examine residuals to assess model performance.
  • df_selcted1.csv works out of the box if you target price.

Classification Analysis

  • Classification Models: Train and compare various classification models.
  • Model Evaluation: Evaluate models using accuracy, precision, recall, and F1-score.
  • Feature Importance: Visualize feature importance.
  • Heat Map: Display a heatmap to visualize correlations.
  • Prediction vs. Actual: Compare predictions against actual labels.
  • Use churn.csv. You will have to do some feature engineering for best results. Target churn as the feature of interest

Extensive Data Analysis

  • Dataset: Explore the Iris dataset for clustering, dimensionality reduction, and feature selection.
  • Clustering: Visualize clusters in 3D space using K-Means clustering.
  • Dimensionality Reduction: Reduce dimensionality using PCA and visualize in 2D space.
  • Feature Selection: Select top features based on F-values and p-values.
  • This was designed to bring in the hello world of data, the iris data set. For out-of-the-box use, use iris.csv

Time Serires ARIMA Analysis

  • Time Series Analysis with ARIMA: Conduct time series analysis, including ACF and PACF plots, ARIMA model fitting, model diagnostics, forecasting, and model evaluation.
  • Works well with the ParkData_5years.csv file.

Causality Analysis

  • Define Relationships: Define causal relationships between variables using a graphical interface or upload a DOT file.
  • Create Causal Model: Create a causal model based on defined relationships and estimate causal effects.
  • Estimation: Estimate causal effects and assess their significance.
  • Run Refutation Tests: Perform refutation tests to test the reliability of causal estimates.

-Works out of the box if you bring in df_selcted1.csv and bring in in the causal graph section. Target price as the outcome and accommodates as the treatment

Decision Tree Analysis

This section provides an in-depth analysis using Decision Trees, which can be used for both classification and regression tasks. Decision Trees are versatile and interpretable machine learning models that are particularly useful for understanding the decision-making process of the model.

Classifier -

  • Data: To perform classification analysis, you can use the provided dataset DTClassiferWillBuy.csv. This dataset is suitable for predicting whether customers will make a purchase based on various features.
  • Model Training: Choose features and the target variable from the dataset. You can customize the hyperparameters for the Decision Tree Classifier, such as max depth, min samples split, min samples leaf, and max features. These parameters allow you to fine-tune the model’s performance.
  • Model Evaluation: After training the model, you can assess its accuracy and generate a classification report. The classification report includes important metrics such as precision, recall, F1-score, and more to evaluate the classifier’s performance.
  • Visualizations: Visualize the Decision Tree structure to understand how the model makes decisions. This visualization can help you interpret the model’s behavior and identify important features.
  • Use DTClassiferWillBuy.csv; you will have to one hot encode.

Regressor -

  • Data: For regression analysis, you can use the dataset DTRegressionLoan.csv. This dataset is suitable for predicting numerical values, such as loan amounts or prices, based on selected features.
  • Model Training: Similar to the classifier, select features and the target variable. Adjust hyperparameters like max depth, min samples split, min samples leaf, and max features for the Decision Tree Regressor to fit your specific regression task.
  • Model Evaluation: Evaluate the regressor’s performance by calculating metrics like Mean Squared Error (MSE) and R-squared (R²) score. These metrics help measure how well the model predicts numerical outcomes.
  • Visualizations: Visualize the Decision Tree structure for regression to gain insights into how the model splits and predicts values. This visualization can aid in understanding the model’s decision-making process.
  • Use DTRegressionLoan.csv, check features for one hot encoding.

Feel free to explore the Decision Tree Analysis section using the provided datasets. You can train, evaluate, and visualize Decision Trees for both classification and regression tasks, gaining valuable insights into your data.

Save your model

  • Model Saving: After all your hard work of feature engineering exploring the data and the models you can save your current csv file.

Great work @nstepka . I have found a bug when performing one hot encoding.

1 Like

Thank you for bringing that to my attention. The issue you are having, I believe, is that there are some rows in the Gender column that are most likely null.

I have implemented a null check informing you to check for missing values before encoding to make sure you cant get that error

Thanks @nstepka . I have a small suggestion that to add any success message for every task completed.

1 Like

Excellent app.
A quick youtube video explaining with a dataset would be great help too.

1 Like

I thought so, too!
The app has been updated a lot since, but here are a few older versions with a video.

I’ll think about redoing it with the newest version.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.