1.2
Basic Statistical Concept
Introduction
In this subtopic, we will explore fundamental tools and ideas that form the backbone of statistical analysis as follows: 1.2 Basic Statistical Concepts
1.2.1 Graphical Description of Variability
1.2.2 Probability Distributions
1.2.3 Mean, Variance, and Expected Values
1.2.4 Sampling and Sampling Distributions
Learning Outcome (LO)
Upon completion of this lesson, you should be able to:
LO2: Apply key statistical concepts to analyze and interpret data.
1.2
Basic Statistical Concept
In this topic, we consider experiments to compare two conditions (sometimes called treatments). These are often called simple comparative experiments. We begin with an example of an experiment performed to determine whether two different formulations of a product give equivalent results. The discussion leads to a review of several basic statistical concepts, such as graphical description of variability, probability distributions, mean, variance, and expected values, and sampling and sampling distributions.
An engineer is studying the formulation of a Portland cement mortar. He has added a polymer latex emulsion during mixing to determine if this impacts the curing time and tension bond strength of the mortar. The experimenter prepared 10 samples of the original formulation and 10 samples of the modified formulation.
Each of the observations in the Portland cement experiment described above would be called a run with two different formulations as two treatments or as two levels of the factor formulations. Notice that the individual runs differ, so there is fluctuation, or noise, in the observed bond strengths. This noise is usually called experimental error or simply error. It is a statistical error, meaning that it arises from variation that is uncontrolled and generally unavoidable.
The presence of error or noise implies that the response variable, tension bond strength, is a random variable.
A random variable may be either discrete or continuous.
1.2.1 Graphical Description of Variability
We often use simple graphical methods to assist in analyzing the data from an experiment. It is useful for summarizing the information in a sample of data
The dot diagram
The box plot (or box-and-whisker plot)
The histogram
The dot diagram
A very useful device for displaying a small body of data (say up to about 20 observations).Enables the experimenter to see quickly the general location or central tendency of the observations and their spread or variability. For example, in the Portland cement tension bond experiment, the dot diagram reveals that the two formulations may differ in mean strength but that both formulations produce about the same variability in strength.
The histogram
If the data are fairly numerous, the dots in a dot diagram become difficult to distinguish and a histogram may be preferable. Figure 2.2 presents a histogram for 200 observations on the metal recovery, or yield, from a smelting process.
The histogram shows the central tendency, spread, and general shape of the distribution of the data. Recall that a histogram is constructed by dividing the horizontal axis into bins (usually of equal length) and drawing a rectangle over the jth bin with the area of the rectangle proportional to nj , the number of observations that fall in that bin. The histogram is a large-sample tool. When the sample size is small the shape of the histogram can be very sensitive to the number of bins, the width of the bins, and the starting value for the first bin. Histograms should not be used with fewer than 75–100 observations
The box plot (or box-and-whisker plot)
The box plot (or box-and-whisker plot) is a very useful way to display data. A box plot displays the minimum, the maximum, the lower and upper quartiles (the 25th percentile and the 75th percentile, respectively), and the median (the 50th percentile) on a rectangular box aligned either horizontally or vertically. The box extends from the lower quartile to the upper quartile, and a line is drawn through the box at the median. Lines (or whiskers) extend from the ends of the box to (typically) the minimum and maximum values.
Figure 2.3 presents the box plots for the two samples of tension bond strength in the Portland cement mortar experiment. This display indicates some difference in mean strength between the two formulations. It also indicates that both formulations produce reasonably symmetric distributions of strength with similar variability or spread.
1.2.2 Probability Distributions
The probability structure of a random variable, say y, is described by its probability distribution. If y is discrete, we often call the probability distribution of y, say p(y), the probability mass function of y. If y is continuous, the probability distribution of y, say f(y), is often called the probability density function for y.
Figure 2.4 illustrates hypothetical discrete and continuous probability distributions. Notice that in the discrete probability distribution Fig. 2.4a, it is the height of the function p(yj) that represents probability, whereas in the continuous case Fig. 2.4b, it is the area under the curve f(y) associated with a given interval that represents probability.
The properties of probability distributions may be summarized quantitatively as follows:
1.2.3 Mean, Variance, and Expected Values
The mean, 𝜇, of a probability distribution is a measure of its central tendency or location. Mathematically, we define the mean as
We may also express the mean in terms of the expected value or the long-run average value of the random variable y as where E denotes the expected value operator.
Mean, Variance, and Expected Values
The variability or dispersion of a probability distribution can be measured by the variance, defined as
Note that the variance can be expressed entirely in terms of expectation because Finally, the variance is used so extensively that it is convenient to define a variance operator V such that
1.2.4 Sampling and Sampling Distributions
i) Random Samples, Sample Mean, and Sample Variance
The objective of statistical inference is to draw conclusions about a population using a sample from that population. Most of the methods that we will study assume that random samples are used. A random sample is a sample that has been selected from the population in such a way that every possible sample has an equal probability of being selected. Statistical inference makes considerable use of quantities computed from the observations in the sample. We define a statistic as any function of the observations in a sample that does not contain unknown parameters. For example, suppose that y1, y2, . . . , yn represents a sample. Then the sample mean
and the sample variance are both statistics. These quantities are measures of the central tendency and dispersion of the sample, respectively. Several properties are required of good point estimators. Two of the most important are
the following:
1. The point estimator should be unbiased. That is, the long-run average or expected
value of the point estimator should be equal to the parameter that is being estimated.
2. An unbiased estimator should have minimum variance. This property states that the minimum variance point estimator has a variance that is smaller than the variance of any other estimator of that parameter
1.2.4 Sampling and Sampling Distributions
ii) Degrees of Freedom
The number of degrees of freedom of a sum of squares is equal to the number of independent elements in that sum of squares. For example consists of the sum of squares of the n elements These elements are not all independent because In fact, only n-1 of them are independent, implying that SS has n-1 degrees of freedom.
1.2.4 Sampling and Sampling Distributions
iii) The Normal Sampling Distributions
Often we are able to determine the probability distribution of a particular statistic if we know the probability distribution of the population from which the sample was drawn. The probability distribution of a statistic is called a sampling distribution. One of the most important sampling distributions is the normal distribution. If y is a
normal random variable, the probability distribution of y is where is the mean of the distribution and is the variance.
Many statistical techniques assume that the random variable is normally distributed.
The central limit theorem is often a justification of approximate normality. This result states essentially that the sum of n independent and identically distributed random variables is approximately normally distributed. In many cases, this approximation is good for very small n, say n < 10, whereas in other cases large n is required, say n > 100.
Now, let’s proceed to the next subtopic for further exploration.
- Dr. Nurulhuda
Subtopic 1.2
SDE USM
Created on September 5, 2024
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Customer Service Course
View
Dynamic Visual Course
View
Dynamic Learning Course
View
Akihabara Course
Explore all templates
Transcript
1.2
Basic Statistical Concept
Introduction
In this subtopic, we will explore fundamental tools and ideas that form the backbone of statistical analysis as follows: 1.2 Basic Statistical Concepts 1.2.1 Graphical Description of Variability 1.2.2 Probability Distributions 1.2.3 Mean, Variance, and Expected Values 1.2.4 Sampling and Sampling Distributions
Learning Outcome (LO)
Upon completion of this lesson, you should be able to: LO2: Apply key statistical concepts to analyze and interpret data.
1.2
Basic Statistical Concept
In this topic, we consider experiments to compare two conditions (sometimes called treatments). These are often called simple comparative experiments. We begin with an example of an experiment performed to determine whether two different formulations of a product give equivalent results. The discussion leads to a review of several basic statistical concepts, such as graphical description of variability, probability distributions, mean, variance, and expected values, and sampling and sampling distributions.
An engineer is studying the formulation of a Portland cement mortar. He has added a polymer latex emulsion during mixing to determine if this impacts the curing time and tension bond strength of the mortar. The experimenter prepared 10 samples of the original formulation and 10 samples of the modified formulation.
Each of the observations in the Portland cement experiment described above would be called a run with two different formulations as two treatments or as two levels of the factor formulations. Notice that the individual runs differ, so there is fluctuation, or noise, in the observed bond strengths. This noise is usually called experimental error or simply error. It is a statistical error, meaning that it arises from variation that is uncontrolled and generally unavoidable. The presence of error or noise implies that the response variable, tension bond strength, is a random variable. A random variable may be either discrete or continuous.
1.2.1 Graphical Description of Variability
We often use simple graphical methods to assist in analyzing the data from an experiment. It is useful for summarizing the information in a sample of data
The dot diagram
The box plot (or box-and-whisker plot)
The histogram
The dot diagram
A very useful device for displaying a small body of data (say up to about 20 observations).Enables the experimenter to see quickly the general location or central tendency of the observations and their spread or variability. For example, in the Portland cement tension bond experiment, the dot diagram reveals that the two formulations may differ in mean strength but that both formulations produce about the same variability in strength.
The histogram
If the data are fairly numerous, the dots in a dot diagram become difficult to distinguish and a histogram may be preferable. Figure 2.2 presents a histogram for 200 observations on the metal recovery, or yield, from a smelting process.
The histogram shows the central tendency, spread, and general shape of the distribution of the data. Recall that a histogram is constructed by dividing the horizontal axis into bins (usually of equal length) and drawing a rectangle over the jth bin with the area of the rectangle proportional to nj , the number of observations that fall in that bin. The histogram is a large-sample tool. When the sample size is small the shape of the histogram can be very sensitive to the number of bins, the width of the bins, and the starting value for the first bin. Histograms should not be used with fewer than 75–100 observations
The box plot (or box-and-whisker plot)
The box plot (or box-and-whisker plot) is a very useful way to display data. A box plot displays the minimum, the maximum, the lower and upper quartiles (the 25th percentile and the 75th percentile, respectively), and the median (the 50th percentile) on a rectangular box aligned either horizontally or vertically. The box extends from the lower quartile to the upper quartile, and a line is drawn through the box at the median. Lines (or whiskers) extend from the ends of the box to (typically) the minimum and maximum values.
Figure 2.3 presents the box plots for the two samples of tension bond strength in the Portland cement mortar experiment. This display indicates some difference in mean strength between the two formulations. It also indicates that both formulations produce reasonably symmetric distributions of strength with similar variability or spread.
1.2.2 Probability Distributions
The probability structure of a random variable, say y, is described by its probability distribution. If y is discrete, we often call the probability distribution of y, say p(y), the probability mass function of y. If y is continuous, the probability distribution of y, say f(y), is often called the probability density function for y. Figure 2.4 illustrates hypothetical discrete and continuous probability distributions. Notice that in the discrete probability distribution Fig. 2.4a, it is the height of the function p(yj) that represents probability, whereas in the continuous case Fig. 2.4b, it is the area under the curve f(y) associated with a given interval that represents probability.
The properties of probability distributions may be summarized quantitatively as follows:
1.2.3 Mean, Variance, and Expected Values
The mean, 𝜇, of a probability distribution is a measure of its central tendency or location. Mathematically, we define the mean as We may also express the mean in terms of the expected value or the long-run average value of the random variable y as where E denotes the expected value operator.
Mean, Variance, and Expected Values
The variability or dispersion of a probability distribution can be measured by the variance, defined as Note that the variance can be expressed entirely in terms of expectation because Finally, the variance is used so extensively that it is convenient to define a variance operator V such that
1.2.4 Sampling and Sampling Distributions
i) Random Samples, Sample Mean, and Sample Variance
The objective of statistical inference is to draw conclusions about a population using a sample from that population. Most of the methods that we will study assume that random samples are used. A random sample is a sample that has been selected from the population in such a way that every possible sample has an equal probability of being selected. Statistical inference makes considerable use of quantities computed from the observations in the sample. We define a statistic as any function of the observations in a sample that does not contain unknown parameters. For example, suppose that y1, y2, . . . , yn represents a sample. Then the sample mean
and the sample variance are both statistics. These quantities are measures of the central tendency and dispersion of the sample, respectively. Several properties are required of good point estimators. Two of the most important are the following: 1. The point estimator should be unbiased. That is, the long-run average or expected value of the point estimator should be equal to the parameter that is being estimated. 2. An unbiased estimator should have minimum variance. This property states that the minimum variance point estimator has a variance that is smaller than the variance of any other estimator of that parameter
1.2.4 Sampling and Sampling Distributions
ii) Degrees of Freedom
The number of degrees of freedom of a sum of squares is equal to the number of independent elements in that sum of squares. For example consists of the sum of squares of the n elements These elements are not all independent because In fact, only n-1 of them are independent, implying that SS has n-1 degrees of freedom.
1.2.4 Sampling and Sampling Distributions
iii) The Normal Sampling Distributions
Often we are able to determine the probability distribution of a particular statistic if we know the probability distribution of the population from which the sample was drawn. The probability distribution of a statistic is called a sampling distribution. One of the most important sampling distributions is the normal distribution. If y is a normal random variable, the probability distribution of y is where is the mean of the distribution and is the variance.
Many statistical techniques assume that the random variable is normally distributed. The central limit theorem is often a justification of approximate normality. This result states essentially that the sum of n independent and identically distributed random variables is approximately normally distributed. In many cases, this approximation is good for very small n, say n < 10, whereas in other cases large n is required, say n > 100.
Now, let’s proceed to the next subtopic for further exploration.
- Dr. Nurulhuda