This is the first part of a project series to show a case in which I will apply advanced analytics, involving data visualization and predictions with Machine Learning, regarding Car Fuel Consumption.
For this project, I used a "Car Fuel Consumption per gas fuel type" dataset from Kaggle. From the Kaggle Dataset context, there are 4 questions formulated:
- Question #1: Which gas type consumes the most? E10 or SP98?
- Question #2: How much is the consume?
- Question #3: It consumes 0.4 liters more with E10 gas, isn't it?
- Question #4: Which of the two fuels is cheaper, E10 or SP 98?
"All these questions are answered in the notebook of this repository."
- Performing data maintenance or cleaning: Duplicates, null-values, outliers
- String operations: Normalize to lowercase, replacing numbers to categorical and vice-versa
- Feature Engineering: Merging features, such as 'consumption rate' based on distance and consumed gas
- Relational model transformation: For future project steps
- Answering questions from our Exploratory Data Analysis: The main goal of this repository
We can observe the consume rate (liter per kms), regarding the car speed records, is higher for the gas type SP98
gas_type | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
E10 | 138.0 | 0.304058 | 0.125085 | 0.04 | 0.22 | 0.295 | 0.3975 | 0.68 |
SP98 | 197.0 | 0.321421 | 0.126652 | 0.09 | 0.24 | 0.310 | 0.4000 | 0.73 |
gas_type | consume_rate | speed |
---|---|---|
E10 | 0.304058 | 43.289855 |
SP98 | 0.321421 | 41.497462 |
gas_type | consume_rate | speed |
---|---|---|
E10 | 0.295 | 42.0 |
SP98 | 0.310 | 41.0 |
- E10 has a mean consumption rate of 0.304 and a median of 0.295.
- SP98 has a mean consumption rate of 0.321 and a median of 0.310.
Both gas types show similar patterns in terms of distribution and spread (std), with SP98 having a slightly higher average consumption rate compared to E10.
Here is the reason why I created a relational-ish table df:
I am showing in the previous image, the merge of the gas consumption records to the treated dataset, in order to figure out if E10 is consumed 0.4 liters more than SP98. Now, with a groupby, I am getting the mean and the median for such consumptions:
gas_type | consume |
---|---|
E10 | 4.781159 |
SP98 | 4.668020 |
gas_type | consume |
---|---|
E10 | 4.7 |
SP98 | 4.6 |
It consumes 0.4 liters more with E10 gas, isn't it? The answer is Not necessarily.
It depended on the number of experiments per gas type. That is why I did the Feature Engineering in the first place, to mean creating the column 'consume_rate' to be accurate in telling which gas type consumes the most.
From the Kaggle Dataset context:
- E10 is sold for โฌ 1.38
- SP98 is sold for โฌ 1.46
In this analysis, we compare the mean and median prices paid for two types of gasoline: E10 and SP98.
The mean price is the average of all prices paid for each type of gasoline. Here are the results:
- Mean price of E10: 6.598
- Mean price of SP98: 6.815
This means that, on average, the price paid for E10 is lower than the price paid for SP98.
Is E10 cheaper than SP98? Yes, on average, E10 is cheaper than SP98.
The median price is the middle value in a set of data when it is ordered from lowest to highest. Here are the results:
- Median price of E10: 6.486
- Median price of SP98: 6.716
This means that if we order all the prices paid for each type of gasoline, the central value for E10 is lower than the central value for SP98.
Is E10 cheaper than SP98? Yes, according to the median prices, E10 is cheaper than SP98.
Both the mean and median price analyses indicate that E10 is generally cheaper than SP98. This information can be useful for consumers looking for more economical options when choosing the type of gasoline for their vehicles.
After this Exploratory Data analysis, the next experiment will be Machine Learning training models for Gas Consumption predictions. We export the 'pre-processed' dataset as follows:
- pandas: For data manipulation and analysis.
- numpy: For mathematical operations and array manipulation.
- matplotlib.pyplot: For data visualization.
- seaborn: For statistical data visualization.
- re: For regular expression operations.
- scipy: For scientific and technical computing.
โโโ project
โโโ data
โ โโโ raw
โ โ โโโ measurements.csv
โ โโโ pre_processed
โ โ โโโ pre_processed_gas_df.csv
โโโ notebooks
โ โโโ main.ipynb
โโโ README.md
- Do you have any hypothesis?
- Can you make any kind of prediction: regression and/or classification?
- Obtain related data by web scraping or with APIs.
- Load the processed information into a database
For further information, reach me at [email protected]