Tuesday, 31 January 2023

Python pandas data exploration and Visualization

 

Data Exploration: Techniques for Exploring and Summarizing Data

Data exploration is an important step in the data analysis process as it helps to understand the structure, distribution and relationships of the data before further analysis. This stage is crucial to identify potential outliers, missing values, trends and patterns in the data. In this article, we will learn about some common techniques for exploring and summarizing data, including descriptive statistics and data visualization.

  1. Descriptive Statistics Descriptive statistics summarize the central tendencies and dispersion of the data. The following are some of the commonly used descriptive statistics measures:
  • Mean: The average value of the data.
  • Median: The middle value of the data.
  • Mode: The most frequently occurring value in the data.
  • Range: The difference between the highest and the lowest values in the data.
  • Variance: The average of the squared differences from the mean.
  • Standard Deviation: The square root of the variance.

Example code in Python:


import numpy as np
 
# Define the data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 
# Mean
mean = np.mean(data)
print("Mean: ", mean)
 #output -> Mean :5.5
# Median
median = np.median(data)
print("Median: ", median)
 #output -> Median : 5.5
# Mode
from statistics import mode
mode = mode(data)
print("Mode: ", mode)
 #output -> Mode : 1
# Range
range = np.ptp(data)
print("Range: ", range)
 #output -> Range : 9
# Variance
variance = np.var(data)
print("Variance: ", variance)
 #output -> Variance : 8.25
# Standard Deviation
std_dev = np.std(data)
print("Standard Deviation: ", std_dev)
#output -> Standard Deviation : 2.8722813232690143
  1. Data Visualization Data visualization is a powerful tool for exploring and summarizing data. It helps to understand the data better and uncover hidden patterns and trends. Some common data visualization techniques are:
  • Line Plot: A line plot is used to represent continuous data over time.
  • Scatter Plot: A scatter plot is used to visualize the relationship between two variables.
  • Histogram: A histogram represents the distribution of the data.
  • Box Plot: A box plot represents the distribution of the data and highlights any outliers.

Example code in Python using Matplotlib library:

import matplotlib.pyplot as plt
 
# Define the data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 

# Line Plot

plt.plot(data)

plt.title("Line Plot")
plt.show()
 

# Scatter Plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.scatter(x, y)
plt.title("Scatter Plot")
plt.show()


 
# Histogram
plt.hist(data, bins=5)
plt.title("Histogram")
plt.show()


You can play with "bins" value.
 
# Box Plot
plt.boxplot(data)
plt.title("Box Plot")
plt.show()



In conclusion, data exploration is an important step in the data analysis process. Descriptive statistics and data visualization are two important techniques for exploring and summarizing data. 


Amelioration

This article was researched and written with the help of ChatGPT, a language model developed by OpenAI.

Special thanks to ChatGPT for providing valuable information and examples used in this article.


No comments:

Post a Comment