This vignette provides an example of model fitting using least squares spectral analysis via lowest frequency iteration (LSSA-LFI). This approach fits haromonic series iteratively to time-index data, which have an unknown frequency or frequencies, along the same lines as the Vanicek method (Vaníček 1969) or the Lomb-Scargle (Lomb 1976; Scargle 1982). But, it differs in the selection of the frequency used to the fit the model, in that the lowest frequency that elicits a peak, rather than the frequency with the highest power, is selected.
The goal is determine if two time-indexed observations are the result of the same data-generating process, in that they are both linearly dependent upon another process. For more details, see the paper, “Revisiting Babylonian Prices: Long-Run and Variable Length Equilibria, ca. 400-80 BCE” (currently under review).
We start by generating three sets of observations over time, of differing sizes. We will create observations from a random walk model, \(z_t = z_{t-1} + a + \epsilon_t\), where \(\epsilon_t\) is distributed \(\mathcal{N}(0,1)\), of a total size of 100 observations, at uniformly random intervals of time. Then, we create two series which are dependent upon \(z_t\), as \(x_{1t} = s_1 z_t + \epsilon_t\) and \(x_{2t} = s_2 z_t + \epsilon_t\). Finally, the third set will be generated independently, \(x_{3t} = x_{3-t-1} + \epsilon_t\). From these three datasets, \(X_1\), \(X_2\), and \(X_3\), observations will be subsampled of random size. These example data will then all be standardized, and so the choice of \(s_1, s_2\) is obviated.
set.seed(9)
Z <- cumsum(rnorm(100))
X1 <- Z + rnorm(100)
X2 <- Z + rnorm(100)
X3 <- cumsum(rnorm(100))
t_ <- sort(runif(100,0,100))
X1 <- data.frame(t = t_, y = X1)
X2 <- data.frame(t = t_, y = X2)
X3 <- data.frame(t = t_, y = X3)
n1 <- sample(4:100, 1)
n2 <- sample(4:100, 1)
n3 <- sample(4:100, 1)
idx1 <- sort(sample(1:100, n1, replace = FALSE))
idx2 <- sort(sample(1:100, n2, replace = FALSE))
idx3 <- sort(sample(1:100, n3, replace = FALSE))
X1 <- X1[idx1, ]
X2 <- X2[idx2, ]
X3 <- X3[idx3, ]
X1$y <- (X1$y - mean(X1$y)) / sd(X1$y)
X2$y <- (X2$y - mean(X2$y)) / sd(X2$y)
X3$y <- (X3$y - mean(X3$y)) / sd(X3$y)A plot of three processes:
plot(X1, type = "l", xlim = c(0,100))
lines(X2, col = "red")
lines(X3, col = "blue")
legend("topright", legend = 1:3, lty = 1, col = c("black", "red", "blue"))These will be represented as continuous functions of time, \(y_1(t)\), \(y_2(t)\), and \(y_3(t)\). Notwithstanding the observations are missing at different times, for different series, we should want to detect that \(y_1(t)\) and \(y_2(t)\), are dependent upon some other process, but not \(y_3(t)\).
Least squares spectral analysis (LSSA) works by fitting the following
trigonometric series: \[
y(t) = \beta_0 + \sum_{i = 1}^k \left[ \beta_{1i} \sin \omega_i t +
\beta_{2i} \cos \omega_i t \right].
\] Where the frequency \(f_i = \omega_i
/ 2 \pi\) is unknown, the method developed by Vaníček (1969) computes the fit for a grid of
potential frequencies, which can be show in a periodogram in which the
power of each frequency expresses its goodness-of-fit. This is
accomplished with the LSSA() function, which takes a data
frame where time indices are in the first column and values are in the
second column:
Typically, the frequency with the highest peak would be chosen, as it would result in the best model fit, and then one could regress another set of terms on the residuals, and choose the highest frequency, and so on.
Selecting the lowest frequency that elicits a peak however ensures
that the dynamics of the longest period are being captured in the model
first. Then, progressing to selecting higher and higher frequencies,
more and more temportal detail being added by successive iterations. The
LSSA_LFI() function performs this routine for a specified
number of iterations (n_iter), returning the coefficients,
frequencies, residual sum of squares (RSS), and Akaike Information
Criterion (AIC). As the data are already standardized with a mean of 0,
the argument intercept = FALSE is used to avoid a constant
term being fit:
mod1 <- LSSA_LFI(X2, n_iter = 1, intercept = FALSE)
mod1
#> $coefs
#> [1] 0.6280522 0.2440574 0.0000000
#>
#> $freqs
#> [1] 0.026
#>
#> $rss
#> [1] 23.12855
#>
#> $aic
#> [1] 86.89368
#>
#> attr(,"class")
#> [1] "LSSA_LFI" "list"To generate the results of the model, the
LSSA_FI_model() function is used, again for a specified
number of iterations. Here, we can show the model fits of 1, 5, 10, 20,
and 30 iterations (30 iterations = 90 parameters in total, with 2
amplitudes and a frequency in each iteration; or, equivalently, an
amplitude, shift, and frequency on each iteration).
mod1 <- LSSA_LFI_model(X2, t_ = seq(0,100, by = 1), n_iter = 1, intercept = FALSE)
mod5 <- LSSA_LFI_model(X2, t_ = seq(0,100, by = 1), n_iter = 5, intercept = FALSE)
mod10 <- LSSA_LFI_model(X2, t_ = seq(0,100, by = 1), n_iter = 10, intercept = FALSE)
mod20 <- LSSA_LFI_model(X2, t_ = seq(0,100, by = 1), n_iter = 20, intercept = FALSE)
mod30 <- LSSA_LFI_model(X2, t_ = seq(0,100, by = 1), n_iter = 30, intercept = FALSE)
plot(X2, xlim = c(0,100), ylim = c(-2.5,2.5), pch = 16, main = "y2(t)")
lines(y ~ t, data = mod1, type = "l", col = "azure3")
lines(y ~ t, data = mod5, col = "darkgrey")
lines(y ~ t, data = mod10, col = "grey")
lines(y ~ t, data = mod20, col = "azure4")
lines(y ~ t, data = mod30, col = "black")
legend("topright", legend = c(1,5,10,20,30), lty = 1, col = c("azure3", "azure4", "grey", "darkgrey", "black"))In order to assess an appropriate number of parameters, AIC is used. Given a least squares method is used, its computation is \[ \text{AIC} = 2 (3k + 2) + n \log(\hat{\sigma}^2) \] since there are three parameters in every iteration of the trigonometric series.
Choosing the model which yields the lowest AIC can be done with the
LSSA_LFI() function, inputing the number of iterations to
run. If the data do not support fitting a model of a given number of
parameters (i.e., trying to fit a 100-term model to a dataset with 30
observations), an NA value will be returned. Here, a model
of just 2 iterations yields the lowest AIC score. Again, for \(y_2(t)\):
mod2 <- LSSA_LFI(X2, n_iter = 100, intercept = FALSE)
which.min(mod2$aic)
#> [1] 2
mod2_fit <- LSSA_LFI_model(X2, t_ = seq(0,100, by = 1), n_iter = 2, intercept = FALSE)
plot(X2, xlim = c(0,100), ylim = c(-2.5,2.5), pch = 16, main = "y2(t)")
lines(y ~ t, data = mod2_fit, type = "l", col = "black")We can compare the model for \(y_1(t)\), which has more observations attested:
mod1 <- LSSA_LFI(X1, n_iter = 100, intercept = FALSE)
which.min(mod1$aic)
#> [1] 2
mod1_fit <- LSSA_LFI_model(X1, t_ = seq(0,100, by = 1), n_iter = 2, intercept = FALSE)
plot(X1, xlim = c(0,100), ylim = c(-2.5,2.5), pch = 16, main = "y1(t)")
lines(y ~ t, data = mod1_fit, type = "l", col = "black")The goal then is to check if a model which pools the observations of \(y_1(t)\) and \(y_2(t)\) together fits better when described by one set of parameters, rather than two sets. If the datasets are independent, then their parameters will be free to vary from one another.
This task is performed with the LSSA_LFI_candidates()
function, which takes a list of data frames, a number of iterations, and
whether to include an intercept. While the number of terms for each
dataset were 2, it is possible that a moderl with more terms fits the
pooled data better, and so it will be set to 100 to evaluate all options
up to that iteration. The intercept will again be set to
FALSE. It also includes a sets argument, in
which one specifies a list of partitions of the data to
evaluate. If no partition is given, by default the function evaluates
two situations: one in which the all datasets are independent, and one
in which all are dependent. For just two datasets, this consists of two
situations. But, as the number of datasets increase, the number of
possible partitions incease exponentially as they are dependent upon
Bell’s number (e.g., for six datasets, the total number of partitions
comes to 203).
For sets = NULL, the LSSA_LFI_candidates()
function returns one of two potential outcomes. The first is
[1], which is the set of all datasets pooled together,
indicating that linear dependence is a better fit. The second is
[2], indicating that independence is a better fit.
X <- list(X1, X2)
mod_cand <- LSSA_LFI_candidates(X, n_iter = 100, intercept = FALSE)
mod_cand
#> [1] 1The result of [1] indicates that there exists a linear
dependence for the two series (which is known in this example a
priori). Looking at the case of two independent datasets, the
result is that they are independent (result [2]):
X <- list(X1, X3)
mod_cand <- LSSA_LFI_candidates(X, n_iter = 100, intercept = FALSE)
mod_cand
#> [1] 2Of course, erroneous results can be returned: for more information on the statistical distribution of error rates, contingent upon sample size, see the appendix to the paper, “Revisiting Babylonian Prices.”
One can evaluate all paritions comprehensively (for three datasets, it would amount to six partitions), but this will be time-consuming as the numbers of datasets increase. Moreover, the situation is liable to the possibility of Simpson’s reversal, in which the results change on the basis of the addition, removal, or regrouping of datasets.
In order to mitigate the risk of a reversal, the function
LSSA_LFI_validated() provides validation of the
relationship between two datasets by including a third “attendant”
series. As the relationship between the two datasets to be evaluated
(either dependence or independence) should hold no matter what
additional dataset is introduced, this found runs a routine where the
consistency of the model selection process is evaluated. The input
requires a specific pair of indices, whose linear
relationship are of interest. Then, any remaining series in the input
list are included, one per iteration, as a third attedant series. If the
resulting model selection includes the pair within a linearly dependent
group, then, a value of 1 is returned for that iteration;
if not, a value of 0 is returned. The resulting mean
outcomes are proportional to the probability that there is linear
dependence almost surely, given the data at hand.
X <- list("y1" = X1, "y2" = X2, "y3" = X3)
mod_val <- LSSA_LFI_validated(X, pair = c(1, 2), n_iter = 100, intercept = FALSE)
mod_val
#> [1] 1The probability of linear dependence between \(y_1(t)\) and \(y_2(t)\) comes to 1.0 given just \(y_3(t)\) as a third attendant series..
It would be tedious to have to enter in all pairs of datasets, and so
the function LSSA_LFI_pairwise() function returns an
upper-triangular matrix of all probabilities of linear dependence
between all datasets in the input list. To reduce
computational time for this example, we will use just 10 iterations:
X <- list("y1" = X1, "y2" = X2, "y3" = X3)
mod_pair <- LSSA_LFI_pairwise(X, n_iter = 10, intercept = FALSE)
mod_pair
#> y1 y2 y3
#> y1 NA 1 0
#> y2 NA NA 0
#> y3 NA NA NAThis matrix shows that there exists almost surely linear dependence betwee \(y_1(t)\) and \(y_2(t)\), and that there is almost surely indpendence between \(y_1(t)\) and \(y_3(t)\) and \(y_2(t)\) and \(y_3(t)\).