Statistics For Data Science
Statistics topics
➢ Types of Data
➢ Graphical Representation
➢ Frequency Distribution
➢ Sample Population
➢ Central Tendency
➢ Measure of Dispersion
What is Data?
Data is any information collected through observation for the purpose of analysis containingboth numerical or characteristics data points. It is broadly categorized into 2 categories
1) Categorical Data
2) Numerical data
Categorical Data
Categorical can be understood of that data where there are characteristics and have some meaning for each option like language preference, food menu, movie ratings, most of the option we see while filing forms.
It is of 2 types Nominal and Ordinal
Nominal Data
They areoptionbased data thatcontains options but the order of the options don’t matter muchlike
What is your mother tongue?
English, Hindi,Telugu,Tamil
Ordinal Data
Ordinal is more or less same as nominalwith smalldifference thatorder of options matter, and specific.
Example:- Rating of the movies
* * * * *
* * * *
* * *
* *
*
Numerical Data
Discrete Data
Discrete data can be understood of those quantities which can be counted and represented ina discrete or ungrouped Frequency distribution.
Example:-Number of people using different mobile phone in your respective locality
Mobile Brand No.of Users
Samsung 1 Million
RealMe 0.7 Million
2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted butthey can be measured or are has too much variability.Example:-Students scoring marks in test out of 50marks by 100 students
Marks Scored No of Students
0-10 5
10-20 14
20-30 18
30-40 9
40-50 4
TOTAL 50
Frequency Distributions
There are 5 types of distribution in which the data is sorted
I Discrete Frequency Distribution
II Grouped Continuous Frequency distribution
III Cumulative Frequency Distribution
IV) Relative Frequency Distribution
V) Relative Cumulative Frequency Distribution
Based on different frequency Distribution graphical representations are used.
Graphical Representations
There are 4 Graphical representations used in data science
I) PieChart -- Which represents the data in pictorical in cirle form4
II) Bar Graph & Histogram
Bar Graph is used for discrete data set where there is data set is not continuous likeThe current example of education spent by individual counties.Whereas Histogram is used for Data set which continuous grouped data where there are intervals is having continuous limits, such as marks scored bu students out of 100 grouped in intervals of 10 each.
III) Scatter Plots (correlations)
A scatter plot also called a scatterplot, scatter graph; scatter chart, scatter gram,or scatter diagram is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.They are used to plot data when representation is required to be made of a particular dataset based on 2 constraints or factors.
Correlation
The Scatter Diagram Method is the simplest method to study the correlation between two variables wherein the values for each pair of a variable is plotted on a graph in the form of dots thereby obtaining as many points as the number of observations. Then by looking at the scatter of several points, the degree of correlation is ascertained.
The degree to which the variables are related to each other depends on the manner in which the points are scattered over the chart. The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more the points plotted are closer to the line, the higher is the degree of correlation. The degree of correlationis denoted by “r”.
The following types of scatter diagrams tell about the degree of correlation between variableX and variable Y.
1. Perfect Positive Correlation (r=+1)
2. Perfect Negative Correlation (r=-1)
3. High Degree of +Ve Correlation (r= + High)
4. High Degree of –Ve Correlation (r= – High):
5. Low degree of +Ve Correlation (r= + Low):
6. Low Degree of –Ve Correlation (r= + Low):
7. No Correlation (r= 0): r = 0
Central tendency
A measure of Central Tendency is the single value that describes the way in which a group of data clusters around the central value. We have learnt about methods of representing data graphically and in tabular form. Such representations exhibit certain characteristics or salientfeatures of the data.
We have also studied various methods of finding a representative value of the given data. This value is called the central value for the given data and various methods for finding thecentral value are known as the measures of central tendency.
The measures of central tendency are mean (arithmetic mean), median and mode. We have learnt that the measures of central tendency give us one single figure that represents the entire data 1.e., they give us one single figure around which the observations are concentrated.
In other words, measures of central tendency give us a rough idea where observations arecentered.
Mean
The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the number of numbers.
Mode
The "mode" is the value that occurs most often. If no number in the list is repeated, thenthere is no mode for the list.
Median
The "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be listed in numerical order from smallest to largest, so you may have to rewrite your list before you can find the median.
Measure of Dispersion
But the central values are inadequate to give us a complete idea of the distribution as they donot tell us the extent to which the observations vary from the central value. In order to make better interpretation from the data, we should also have an idea how the observations are scattered or how much they are bunched around a central value. There can be two or more distributions having the same central value but still there can be wide disparities in the formation of the distribution as discussed below.
Consider following three distributions:
(i) 1, 5, 9, 13, 17
(ii) 3, 6,9, 12, 15
(iii) 7,8,9, 10, 11
All the three have same mean, median but there is wide variation between the data points ineach distributions.
If follows from the above discussion that the central values (mean, mode, median) are not sufficient to give complete information about a distribution. Variability in the values of the observations of given data gives us better information about the data. So, variability is another factor which is required to be studied in statistics. Like central value, we have a single number to Describe variability of a distribution. This single number is called the dispersion of distribution and various methods of determining or measuring dispersion arecalled the measures of dispersion.
As discussed above that the dispersion is the measure of variations in the values of thevariableit measures the degree of scattered ness of the observations in a distribution around thecentral value.
Following are commonly used measures of dispersion:
(i) Range
(ii) Quartile deviation
(iii) Mean deviation
(iv) Standard deviation
In study of data Science, more often standard Deviation and variance is used so that must be well versed by Students.
Standard Deviation & Variance
This is one of most important part and s used extensively in stats and probability .It is denoted by SD or 𝜎 called Sigma.
Variance is equal to the square of SD and it is represented by 𝜎2 (Sigma
Square)
Sample and Population
Population: - It is the entire group of people or thing we want to study about.
Sample :- It is the part of the population the we actually want to collect data or make some observations about.
Sample Question
A factory overseer selects 50 iPhones produced at random from those produced that week atthe factory, to test their strength.
Sample:- the sample is the 50 iPhones selected.
Population:- is all iPhones produced at the factory that week.