Statistics For Data Science

·

6 min read

Statistics topics

➢ Types of Data

➢ Graphical Representation

➢ Frequency Distribution

➢ Sample Population

➢ Central Tendency

➢ Measure of Dispersion

What is Data?
Data is any information collected through observation for the purpose of analysis containingboth numerical or characteristics data points. It is broadly categorized into 2 categories
1) Categorical Data
2) Numerical data

Categorical Data

Categorical can be understood of that data where there are characteristics and have some meaning for each option like language preference, food menu, movie ratings, most of the option we see while filing forms.

It is of 2 types Nominal and Ordinal

Nominal Data

They areoptionbased data thatcontains options but the order of the options don’t matter muchlike

What is your mother tongue?

English, Hindi,Telugu,Tamil

Ordinal Data

Ordinal is more or less same as nominalwith smalldifference thatorder of options matter, and specific.

Example:- Rating of the movies
* * * * *
* * * *
* * *
* *
*

Numerical Data

Discrete Data

Discrete data can be understood of those quantities which can be counted and represented ina discrete or ungrouped Frequency distribution.

Example:-Number of people using different mobile phone in your respective locality

Mobile Brand No.of Users
Samsung 1 Million
RealMe 0.7 Million

2. Continuous Data

Continuous Data represents measurements and therefore their values can’t be counted butthey can be measured or are has too much variability.Example:-Students scoring marks in test out of 50marks by 100 students

Marks Scored No of Students
0-10 5
10-20 14
20-30 18
30-40 9
40-50 4

TOTAL 50

Frequency Distributions

There are 5 types of distribution in which the data is sorted

I Discrete Frequency Distribution

II Grouped Continuous Frequency distribution

III Cumulative Frequency Distribution

IV) Relative Frequency Distribution

V) Relative Cumulative Frequency Distribution

Based on different frequency Distribution graphical representations are used.

Graphical Representations

There are 4 Graphical representations used in data science

I) PieChart -- Which represents the data in pictorical in cirle form4
II) Bar Graph & Histogram
Bar Graph is used for discrete data set where there is data set is not continuous likeThe current example of education spent by individual counties.Whereas Histogram is used for Data set which continuous grouped data where there are intervals is having continuous limits, such as marks scored bu students out of 100 grouped in intervals of 10 each.

III) Scatter Plots (correlations)

A scatter plot also called a scatterplot, scatter graph; scatter chart, scatter gram,or scatter diagram is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.They are used to plot data when representation is required to be made of a particular dataset based on 2 constraints or factors.

Correlation

The Scatter Diagram Method is the simplest method to study the correlation between two variables wherein the values for each pair of a variable is plotted on a graph in the form of dots thereby obtaining as many points as the number of observations. Then by looking at the scatter of several points, the degree of correlation is ascertained.

The degree to which the variables are related to each other depends on the manner in which the points are scattered over the chart. The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more the points plotted are closer to the line, the higher is the degree of correlation. The degree of correlationis denoted by “r”.

The following types of scatter diagrams tell about the degree of correlation between variableX and variable Y.

1. Perfect Positive Correlation (r=+1)

2. Perfect Negative Correlation (r=-1)

3. High Degree of +Ve Correlation (r= + High)

4. High Degree of –Ve Correlation (r= – High):

5. Low degree of +Ve Correlation (r= + Low):

6. Low Degree of –Ve Correlation (r= + Low):

7. No Correlation (r= 0): r = 0

Central tendency

A measure of Central Tendency is the single value that describes the way in which a group of data clusters around the central value. We have learnt about methods of representing data graphically and in tabular form. Such representations exhibit certain characteristics or salientfeatures of the data.

We have also studied various methods of finding a representative value of the given data. This value is called the central value for the given data and various methods for finding thecentral value are known as the measures of central tendency.

The measures of central tendency are mean (arithmetic mean), median and mode. We have learnt that the measures of central tendency give us one single figure that represents the entire data 1.e., they give us one single figure around which the observations are concentrated.

In other words, measures of central tendency give us a rough idea where observations arecentered.

Mean

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the number of numbers.

Mode

The "mode" is the value that occurs most often. If no number in the list is repeated, thenthere is no mode for the list.

Median

The "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be listed in numerical order from smallest to largest, so you may have to rewrite your list before you can find the median.

Measure of Dispersion

But the central values are inadequate to give us a complete idea of the distribution as they donot tell us the extent to which the observations vary from the central value. In order to make better interpretation from the data, we should also have an idea how the observations are scattered or how much they are bunched around a central value. There can be two or more distributions having the same central value but still there can be wide disparities in the formation of the distribution as discussed below.

Consider following three distributions:

(i) 1, 5, 9, 13, 17

(ii) 3, 6,9, 12, 15

(iii) 7,8,9, 10, 11

All the three have same mean, median but there is wide variation between the data points ineach distributions.

If follows from the above discussion that the central values (mean, mode, median) are not sufficient to give complete information about a distribution. Variability in the values of the observations of given data gives us better information about the data. So, variability is another factor which is required to be studied in statistics. Like central value, we have a single number to Describe variability of a distribution. This single number is called the dispersion of distribution and various methods of determining or measuring dispersion arecalled the measures of dispersion.

As discussed above that the dispersion is the measure of variations in the values of thevariableit measures the degree of scattered ness of the observations in a distribution around thecentral value.

Following are commonly used measures of dispersion:

(i) Range

(ii) Quartile deviation

(iii) Mean deviation

(iv) Standard deviation

In study of data Science, more often standard Deviation and variance is used so that must be well versed by Students.

Standard Deviation & Variance

This is one of most important part and s used extensively in stats and probability .It is denoted by SD or 𝜎 called Sigma.

Variance is equal to the square of SD and it is represented by 𝜎2 (Sigma

Square)

Sample and Population

Population: - It is the entire group of people or thing we want to study about.
Sample :- It is the part of the population the we actually want to collect data or make some observations about.

Sample Question

A factory overseer selects 50 iPhones produced at random from those produced that week atthe factory, to test their strength.

Sample:- the sample is the 50 iPhones selected.

Population:- is all iPhones produced at the factory that week.