Statistics(Descriptive,Inferential) & Probability for [Machine Learning]
#Definition
Statistics is a form of mathematical analysis that uses quantified models, representations, and synopses for a given set of experimental data or real-life studies.
Types of Statistics & Probability
a: Descriptive statistics uses the data to provide descriptions of the population, either through numerical calculations or graphs, or tables.
b: Inferential statistics make inferences and predictions about a population based on a sample of data taken from the population in question.
a1: Probability is simply how likely something is to happen. Whenever we’re unsure about the outcome of an event, we can talk about the probabilities of certain outcomes — how likely they are. The analysis of events governed by probability is called statistics.
#a: List of Topic in Descriptive statistics
[Range, MAD, Variance, Standard Deviation, Coefficient of Variation, Covariance, Skewness, Kurtosis, Correlation Coefficients(Karl Pearson’s, Spearman’ Rank)]
- Range: The range, the difference between the largest value and the smallest value, is the simplest measure of variability in the data.
formula:
Range=highest observation−lowest observation range(estimate value) - MAD: Mean absolute deviation helps us get a sense of how “spread out” the value in a data set is spread out.ex (scatter plot) Formula:
- Variance: In statistics, variance measures variability from the average or mean. Formula
- Standard Deviation: The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. … If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation. Formula
- Coefficient of Variation: The coefficient of variation (CV) is a statistical measure of the relative dispersion of data points in a data series around the mean. In finance, the coefficient of variation allows investors to determine how much volatility, or risk, is assumed in comparison to the amount of return expected from investments. Formula
- Covariance: Covariance provides insight into how two variables are related to one another. More precisely, covariance refers to the measure of how two random variables in a data set will change together imp: relationship of a feature with the target variable. Formula
- Skewness: not symmetrical: know as skew may be right or left
right-skewed > positively skewed =For a right-skewed data, Mean > Median > Mode
left-skewed > negatively skewed For a left-skewed data, Mean < Median < Mode
For a perfect normal distribution, Mean = Median = Mode
- Kurtosis: kurtosis is the measure of the shape of the curve it usually measures the bell of the curve it’s a peak, flat or normal (sharpness of the curve in a frequency distribution)
- correlation coefficient: The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0 Formula
- [Karl Pearson’s]: Pearson’s correlation is utilized when you have two quantitative variables and you wish to see if there is a linear relationship between those variables. Your research hypothesis would represent that by stating that one score affects the other in a certain way. The correlation is affected by the size and sign of the r.
- [Spearman’s Rank]: Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed.
b: List of Topic in Inferential statistics
[Inter — Quartile Range (IQR), Statistics Inferential( Statistical Estimation[ Point Estimates, Interval Estimates], Hypothesis Testing[T-Test, Z-test]), Error types in hypothesis, Central Limit Theorem (CLT), Degrees of Freedom, Point Estimates, Interval Estimates, P_Value, one tile-two tail, chi-square ]
- Inter — Quartile Range (IQR): we can find accurate range through IQR even with outliers(usually outliers is automatically removed in IOR Box Plots
- Central Limit Theorem (CLT): The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger. … A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.
- Statistical Estimation: we are trying to find the values of population parameters based on the sample.
Point Estimates: In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter.
Interval Estimates: In statistics, interval estimation is the use of sample data to calculate an interval of possible values of an unknown population parameter; this is in contrast to point estimation, which gives a single value. Jerzy Neyman identified interval estimation as distinct from point estimation. - one tile-two tail: This is because a two-tailed test uses both the positive and negative tails of the distribution. In other words, it tests for the possibility of positive or negative differences. A one-tailed test is appropriate if you only want to determine if there is a difference between groups in a specific direction
- P_Value:p value is the probability for the null hypothesis to be true.A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.
- Degrees of Freedom: You end up with n — 1 degree of freedom, where n is the sample size. Another way to say this is that the number of degrees of freedom equals the number of “observations” minus the number of required relations among the observations (e.g., the number of parameter estimates) Variance is the measure of dispersion in a data set. In other words, it measures how spread out a data set is
- Hypothesis Testing: hypothesis testing: Here you can see that we are trying to answer certain questions regarding the population. Most of the time we have certain assumptions about population parameters.
Null hypothesis: the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena or no association among groups
Alternate hypothesis: the alternative hypothesis is a position that states something is happening, a new theory is preferred instead of an old one
Testing
#T-Test: we use t-test when we don’t have population Standard Deviation and have a small sample size < 30
#Z-test: z-test is a statistical test to determine whether two population means are different when the variances are known and the sample size is large. It can be used to test hypotheses in which the z-test follows a normal distribution. A z-statistic, or z-score, is a number representing the result from the z-test.
- chi-square: Many times it would occur that the features in a dataset are related to each other, such features do not add much information while building a machine learning model and it is best to remove one of these related features. The chi-squared test of independence is one of the ways to determine this.
- Error types in hypothesis: type 1 error (false positive error): if you are not pregnant but the doctor says that you are pregnant. type 2 error(false negative error): if you are pregnant but the doctor says you're not then very risky depending upon the case to case we have to reduce error 1st or may some time 2nd
a1: List of Topic in Probability
[Classical & Frequentist Approach, Conditional Probability, Independent & Dependent Events, Bayes Theorem, Random variable, Binomial Distribution, Normal Distribution, Standard Normal Distribution]
- Base Relation:
Union ( \cup∪ ) — It is the combination of all the elements of the sets A & B.
Intersection (\cap∩) — It is the combination of all the common elements of the sets
Complement — It is defined as the elements ‘not’ in the set.
Mutually Exclusive — Two sets are mutually exclusive if their intersection is a null set
-Sample space: unique value from population data
-Event: An event is defined as a subset of the sample - Classical & FrequentistApproach [S:sample,A:event,P:probability:N:number of count] Classical:P(A)=N(A)/N(S)
Frequentist :P(A)=N(A)/n - Conditional Probability: In this, we have to find probability with given condition P(event A | condition B )=P(A n B)/P(B) 1 event and 1 given condition. Conditional probability is the probability of one event occurring with some relationship to one or more other events
- Independent & Dependent Events: In probability, two events are independent if the incidence of one event does not affect the probability of the other event. If the incidence of one event does affect the probability of the other event, then the events are dependent.
- Bayes Theorem: Baye's theorem is derived from conditional probability from which we can find out of probability of event 1(A) given that 2(B) event has already occurred. 1 event is already done find the probability of another event
- Random variable: A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes
- Binomial Distribution: frequency distribution of the possible number of successful outcomes in a given number of trials in each of which there is the same probability of success
- Normal Distribution: The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean, so the right side of the center is a mirror image of the left side. The area under the normal distribution [mean = median = mode]
- Standard Normal Distribution: We can convert any normal distribution to a standard normal distribution using the following steps first, subtract the mean, then divide by the Standard Deviation [ Mean=0, Standard deviation=1]