Data Visualization — A Guide for the beginners to create wonders

Introduction

The substantial part of EDA involves in visualizing data to derive business insights. Though numbers can give us the results as the size of the data grows, it becomes hard to interpret the inferences at a sight. Data visualization lends its helping hand to handle the hindrance. Visualization is so powerful that when you know what you want, you will get it. But, be sure to visualize what you want to do before you do it. Here, let’s look into how to apply data visualization techniques to get the at most information possible, hidden on the given data.

A high level Overview of commonly used Data…

Data can be broadly classified into quantitative and qualitative data. Quantitative data is numeric. It takes into account of numbers that are measured objectively, without a bias. Few examples of quantitative data include size measurements like height, weight, length, area, marks of a student, salary of an employee, climatic measurements like temperature, air quality index, etc.

Qualitative data are data which is categorized or classified based on the characteristics or traits. They are mostly descriptive in nature like smell, taste, texture, etc. Qualitative data are non-numeric.

The quantitative and qualitative data can be further classified as discrete, continuous, nominal, ordinal, binomial etc. But, instead of diving deep on these categories I would like to present this classification such that it serves our present purpose to understand how to use the data visualization techniques effectively.

Types of charts

When it comes to visualizations with Python, Matplotlib and Seaborn are the two packages extensively used. So the examples given here are based on Matplotlib and seaborn.

Some of the most frequently used data visualization graphs are Bar graphs, line graphs, pie charts, histograms, probability density function and cumulative density function graphs, box plots, contour plots, pair plots. These graphs are highly aligned and dependent on the data in hand. Let’s discuss on how and where to use these graphs to efficiently visualize at most information on a sight.

1. Bar charts.

The Bar charts can be used when there is a need to compare data between specified intervals. Depending on the need, one can switch between horizontal and vertical bar charts. To compare different items and show the composition of each item being compared on a given interval stacked bar chart can be used.

The below are grouped bar chart shows the data from Haberman cancer survival data set, here based on the number of active lymph nodes the survival % and the death % in the span of 5 years is compared. It can be clearly inferred that the death rate increases with increase in active lymph nodes at first sight.

2. Line Graphs

A line graph can help to understand the progress or the trend of the data over a period of time, which may be of short or long duration. It is highly effective to track the short term changes in a data. Similarly, multiple lines can be drawn to understand the progression or trend of more than one field.

The below mentioned line graph explains the trend in age groups that are more prevalent till 1969. This trend shows that the age of patients admitted reduces exponentially with age.

3. Pie charts

Pie charts are used to describe the composition of categories in a data. A pie chart represents numbers in percentages and the total segment is equal to 100%. Most likely to understand the composition of target variables pie charts can be greatly helpful.

The pie chart below shows the % composition of the target variables of the well-known iris classification data set.

4. Scatter plots

Scatter plots can be helpful to visualize the relationship between any two variables. The output can be interpreted to understand the degree of correlation between the given variables. Scatter plots can also be used to understand the linear separability of the data which is illustrated below.

The below graph shows the relation between petal length and petal width among three different species of the iris data set. Here, a positive correlation is observed in all the three iris species and they are almost linearly separable based on petal length and petal width.

When more than two variables have to be compared pair plots can be used to easily visualize higher dimensional data. But as the dimensions increase, say more than 10, more complex techniques like PCA and t-SNE can be used.

5. Heat maps

The other alternative for pair plots can be heat maps plotted based on the correlation values. On a scale of 0–1, with 1 being the strongest correlation, the below graph explains the relationship of different variables in Haberman cancer survival data set.

Data visualization can be sub-divided into

1. Univariate (1 variable) analysis

2. Bivariate (2 variable) analysis

3. Multivariate (more than 2 variables) analysis.

Most of the uni-variate visualization is engaged in deriving statistical inferences. Let’s continue our discussion on what and where to use these graphs efficiently.

6. Histograms

Histograms can be used to visualize the frequency distribution of different values in a data set. It tells how often each value is repeated. Histograms help to understand the skewness and the type of distribution (like Normal, log-normal etc.) to some extent.

The below graph shows that the nodes are highly skewed towards the right, inferring that very few people admitted have nodes greater than 10.

7. Distribution plots

The advanced version of histograms are distribution plots provided by seaborn. This is a combination of matplotlib hist function and seaborn’s kdeplot () and rugplot (). Distribution plots display the combination of histograms and kde plots as a probability density function, thereby facilitating the frequency of occurrence and also helps to understand the shape of the distribution through the kde plot.

The below distribution plots tells the distribution of ages of patients who survived and died after the surgery in the span of 5 years. It is clearly seen that patients between ages of 45–60 years are showing a higher probability to die after a surgery than surviving in that age group.

8. Box plots

Box plots also called as whiskers plots are used when there is a need to understand the summary statistics (descriptive statistics) terms like quartile ranges (1st, 2nd (median), 3rd ,4th quartiles) and outliers of a variable into consideration. Box plot serves as a one stop point to understand these basic statistical terms in a single plot.

This box plot explains the summary statics of age and year of the patients survived greater than 5 years (status-1) and died within 5 years (status-2) after the surgery.

Say for example, the age of a patients survived after 5 years have a Median-52 years, Min age -30 years, Max age — 78 years and IQR (Inter Quartile Range) is between 43 to 60 years and the other quartile ranges can also be inferred.

9. Violin Plots

Violin plots are extension of box plots with some advanced options, it holds a rotated probability density distribution curve on either side. Violin plots can be used when the data distribution is multi-modal (more than 1 peak), it can show different peaks, their position and their relative amplitude.

Here, the violin plots help in visualizing the summary statistics similar to box plot and also through the PDFs it helps visualizing multiple peaks present in the age of patients who died within 5 years (status — 2) of surgery.

10. Probability dist. function (PDF) and cumulative dist. function (CDF) plots:

PDF and CDF are other forms to represent and understand the distribution of a random variable in terms of likelihoods and cumulative functions respectively.

The specialty about probability distribution function (PDF) is that, PDF describes the likelihoods of a random variable in a given range. PDF curves can be used in analyzing financial data by stock managers, where there is a need to estimate the possible expected returns that a stock may yield in the future.

The cumulative distribution function represents the probability that the random variable (say X) takes a value less than or equal to x, where F(x) = Pr (X ≤ x). It takes up values between 0 and 1.

Computation of PDF and CDF:

Histograms can be used to construct Probability distributions and PDFs in turn can be used to create cumulative distribution functions (CDFs), which adds up the probability of occurrences cumulatively and will always start at zero and end at 100%.

Please find below the simple code for generating PDF from histograms and CDFs from PDF, the data considered here is the Haberman cancer survival data set, the code explains the PDF and CDF of ages of patients admitted in the hospital. Here, the simple numpy package is used for computation (import numpy as np).

The below graph explains the PDF and CDF of ages of patients admitted, the CDF curve shows that roughly 90% of people admitted are less than 65 years of age and the PDF curve shows roughly less than 10% of people are between age 70 to 80 years.

Conclusion

This post doesn’t suggest any hard and fast rules for usage of these plots, but recommends best practices that can be helpful in EDA, I conclude this by saying

· To compare data, Bar charts

· To know the trend in data, Line charts

· To know the composition of the data, Pie charts

· To know the distribution, histograms, box, violin, pdf, cdf plots.

· To know the relations between two variables, scatter plots and heat maps.

Thanks for the patience, Hope I have done some justice to my first technical blog. I bet you’ll have something to say! If you find this helpful, spread the same by sharing it to a friend, Feel free to encourage me with your creative suggestions to improve @ jssuriyakumar@gmail.com. You are most welcome.

References : Wikipedia, Python Data science handbook by Jack Vanderplas.

For any queries related to datasets, or coding, feel free to contact me @ jssuriyakumar@gmail.com.

ML Data Associate | Amazon