6.4.3.1. Single Exponential Smoothing

6. Process or Product Monitoring and Control
6.4. Introduction to Time Series Analysis
6.4.3. What is Exponential Smoothing?

6.4.3.1. Single Exponential Smoothing

Exponential smoothing weights past observations with exponentially decreasing weights to forecast future values

This smoothing scheme begins by setting $S_2$ to $y_1$, where $S_i$ stands for smoothed observation or EWMA, and $y$ stands for the original observation. The subscripts refer to the time periods, $1, \, 2, \, \ldots, \, n$. For the third period, $S_3 = \alpha y_2 + (1-\alpha) S_2$; and so on. There is no $S_1$; the smoothed series starts with the smoothed version of the second observation.

For any time period $t$, the smoothed value $S_t$ is found by computing $$ S_t = \alpha y_{t-1} + (1-\alpha)S_{t-1} \,\,\,\,\,\,\, 0 < \alpha \le 1 \,\,\,\,\,\,\, t \ge 3 \, . $$ This is the basic equation of exponential smoothing and the constant or parameter $\alpha$ is called the smoothing constant.

Note: There is an alternative approach to exponential smoothing that replaces $y_{t-1}$ in the basic equation with $y_t$, the current observation. That formulation, due to Roberts (1959), is described in the section on EWMA control charts. The formulation here follows Hunter (1986).

Setting the first EWMA

The first forecast is very important

The initial EWMA plays an important role in computing all the subsequent EWMAs. Setting $S_2$ to $y_1$ is one method of initialization. Another way is to set it to the target of the process.

Still another possibility would be to average the first four or five observations.

It can also be shown that the smaller the value of $\alpha$, the more important is the selection of the initial EWMA. The user would be wise to try a few methods, (assuming that the software has them available) before finalizing the settings.

Why is it called "Exponential"?

Expand basic equation

Let us expand the basic equation by first substituting for $S_{t-1}$ in the basic equation to obtain $$ \begin{eqnarray} S_t & = & \alpha y_{t-1} + (1-\alpha) \left[ \alpha y_{t-2} + (1-\alpha) S_{t-2} \right] \\ & = & \alpha y_{t-1} + \alpha (1-\alpha) y_{t-2} + (1-\alpha)^2 S_{t-2} \, . \end{eqnarray} $$

Summation formula for basic equation

By substituting for $S_{t-2}$, then for $S_{t-3}$, and so forth, until we reach $S_2$ (which is just $y_1$), it can be shown that the expanding equation can be written as: $$ S_t = \alpha \sum_{i=1}^{t-2} (1-\alpha)^{i-1} y_{t-i} + (1-\alpha)^{t-2} S_2 \, , \,\,\,\,\, t \ge 2 \, . $$

Expanded equation for $S_5$

For example, the expanded equation for the smoothed value $S_5$ is: $$ S_5 = \alpha \left[ (1-\alpha)^0 y_{5-1} + (1-\alpha)^1 y_{5-2} + (1-\alpha)^2 y_{5-3} \right] + (1-\alpha)^3 S_2 \, . $$

Illustrates exponential behavior

This illustrates the exponential behavior. The weights, $\alpha(1-\alpha)^t$ decrease geometrically, and their sum is unity as shown below, using a property of geometric series: $$ \alpha \sum_{i=0}^{t-1} (1-\alpha)^i = \alpha \left[ \frac{1-(1-\alpha)^t}{1-(1-\alpha)} \right] = 1 - (1-\alpha)^t \, . $$ From the last formula we can see that the summation term shows that the contribution to the smoothed value $S_t$ becomes less at each consecutive time period.

Example for $\alpha = 0.3$

Let $\alpha = 0.3$. Observe that the weights $\alpha(1-\alpha)^t$ decrease exponentially (geometrically) with time.

	Value	weight

last	$y_1$	0.2100
	$y_2$	0.1470
	$y_3$	0.1029
	$y_4$	0.0720

What is the "best" value for $\alpha$?

How do you choose the weight parameter?

The speed at which the older responses are dampened (smoothed) is a function of the value of $\alpha$. When $\alpha$ is close to 1, dampening is quick and when $\alpha$ is close to 0, dampening is slow. This is illustrated in the table below.

---------------> towards past observations

$\alpha$	$(1-\alpha)$	$(1-\alpha)^2$	$(1-\alpha)^3$	$(1-\alpha)^4$

0.9	0.1	0.01	0.001	0.0001
0.5	0.5	0.25	0.125	0.0625
0.1	0.9	0.81	0.729	0.6561

We choose the best value for $\alpha$ so the value which results in the smallest MSE.

Example

Let us illustrate this principle with an example. Consider the following data set consisting of 12 observations taken over time:

Time	$y_t$	$S(\alpha=0.1)$	Error	Error squared

1	71
2	70	71	-1.00	1.00
3	69	70.9	-1.90	3.61
4	68	70.71	-2.71	7.34
5	64	70.44	-6.44	41.47
6	65	69.80	-4.80	23.04
7	72	69.32	2.68	7.18
8	78	69.58	8.42	70.90
9	75	70.43	4.57	20.88
10	75	70.88	4.12	16.97
11	75	71.29	3.71	13.76
12	70	71.67	-1.67	2.79

The sum of the squared errors (SSE) = 208.94. The mean of the squared errors (MSE) is the SSE /11 = 19.0.

Calculate for different values of $\alpha$

The MSE was again calculated for $\alpha = 0.5$ and turned out to be 16.29, so in this case we would prefer an $\alpha$ of 0.5. Can we do better? We could apply the proven trial-and-error method. This is an iterative procedure beginning with a range of $\alpha$ between 0.1 and 0.9. We determine the best initial choice for $\alpha$ and then search between $\alpha - \Delta$ and $\alpha + \Delta$. We could repeat this perhaps one more time to find the best $\alpha$ to 3 decimal places.

Nonlinear optimizers can be used

But there are better search methods, such as the Marquardt procedure. This is a nonlinear optimizer that minimizes the sum of squares of residuals. In general, most well designed statistical software programs should be able to find the value of $\alpha$ that minimizes the MSE.

Sample plot showing smoothed data for 2 values of $\alpha$

Plot with raw data and smoothed data for alpha = .1 and alpha = .5