A Guide to Lakhesis

This vignette provides a guide to the R package lakhesis. Please refer to the paper, “Lakhesis: Consensus Seriation via Iterative Regression of Partial Rankings for Binary Data,” for details.

Introduction

Seriation can be accomplished using a variety of methods. The main goals of the lakhesis package, which focus on binary matrices (i.e., presence/absence, 0/1), are to provide:

  1. an interactive graphical interface for investigators to select their own well seriated data and perform seriation via correspondence analysis;
  2. the means to establish a single consensus seriation from multiple, partial seriations via a critical coefficient of optimality.

While there are already a number of approaches and tools to perform seriations, lakhesis uses a technique of fitting CA scores to those of a curve produced by an ideal, well seriated “reference matrix” via a Procrustes method (center, scale, and rotate).

Since well seriated matrices are not a given for every data set, it is expedient for users to select their own seriations quickly via an interactive graphical interface, rather than slicing matrices and re-running commands via the console. Investigators can select multiple seriations, moreover, which can then be harmonized into a single consensus seriation.

The lakhesis package, named for the ancient Greek goddess who measured the thread of fate, provides this functionality, to explore and select well seriated sequences as “strands” that are then “threaded” together into a single consensus seriation. The package can either be installed from CRAN, or from GitHub, the latter of which hosts the most recent development version. To install from GitHub, the library devtools is needed:

library("devtools")
install_github("scollinselliott/lakhesis", dependencies = TRUE, build_vignettes = TRUE) 

This vignette showcases the primary functions of the package along with the Lakhesis Calculator, a shiny app whose interface offers the means to perform and critically examine seriated sequences.

Getting Started: Formatting Data

The Calculator is designed to import csv files directly from the user’s file directory. The csv file must be formatted in a “long” format where row and column incidences are listed in rows, pair by pair. To convert data that is already in the form of a matrix, the im_long() function can used. The "quattrofontanili" data included in the package can serve as an example. This data set comprises a matrix of 81 rows and 82 columns, recording the incidence of a particular artifact-type (column) in a tomb (row) from Early Iron Age necropoleis in Southern Etruria (see ?quattrofontanili).

library("lakhesis")
data("quattrofontanili")
qf_long <- im_long(quattrofontanili)

The data.frame qf_long can then be exported to a csv file using the write.table() function, in order to avoid writing row and column headers:

write.table(qf_long, file = "qf.csv", row.names = FALSE, col.names = FALSE, sep = ",")

The file qf.csv is then ready to be uploaded into the Lakhesis Calculator.

Pooling Observations

If one has multiple data sets to pool together, it is simply a matter of adding supplementary rows to the csv file. The im_merge() function assists with creating a single incidence matrix from two different incidence matrices, overriding any 0 values with 1s in the case of shared row and/or column elements. The merged matrix can then be converted using im_long() and exported, as above.

qf1 <- quattrofontanili[1:20, 1:40]
qf1 <- qf1[rowSums(qf1) != 0, colSums(qf1) != 0]

qf2 <- quattrofontanili[30:50, 20:60]
qf2 <- qf2[rowSums(qf2) != 0, colSums(qf2) != 0]

im_merge(qf1, qf2)

Procrustes-Fit Correspondence Analysis

The Lakhesis Calculator uses correspondence analysis (CA), a popular method used for seriation especially in the form of detrended correspondence analysis (DCA). Unlike DCA, lakhesis seriates observations according to a “reference curve,” since the plot of principal scores of well seriated rows and columns will have the form of an arch or horseshoe. Principal scores of the data are fit to the curve of the reference matrix using a Procrustes method (center, scale, rotate), with axes are properly labelled “Procrustes1” and “Procrustes2” since they no longer represent the chi-squared distance between row and columnt elements, separately. Procrustes fitting is done without landmark points by minimizing the squared Euclidean distance of scores to those of the reference matrix.

This task is accomplished with the function ca_procrustes(), which takes an incidence matrix as an input and returns an object procrustean that contains the reference and data scores. The reference matrix is generated internal to ca_procrustes() using the im_ref() function. The plot function automatically illustrates the fit of the points to the reference curve:

data("quattrofontanili")
qf_caproc <- ca_procrustes(quattrofontanili)
class(qf_caproc)
#> [1] "procrustean" "list"
print(qf_caproc)
#> Procrustes-fit CA scores of class procrustean list 
#>   $ref : principal scores: reference points
#>   $x : principal scores: row points
#>   $y : principal scores: column points
plot(qf_caproc)

In order to determine a seriation for both the row and column points, scores are projected onto the quadratic model which fits the reference points. Then, their ranking is taken from one end of the curve to another. The function ca_procrustes_ser() takes an incidence matrix as input, and produces as output a strand object, which comprises a list that includes data regarding the row and column score points, distance, and rankings, as well as the seriated matrix according those rankings.

qf_ser <- ca_procrustes_ser(quattrofontanili)
class(qf_ser)
#> [1] "strand" "list"
print(qf_ser)
#> Procrustes-fit CA of class: strand list 
#> Use summary() for more information.
summary(qf_ser)
#> Procrustes-fit CA of class: strand list 
#> 
#> Ranking of row elements: 
#> QF S7 QF M22 V la F 29/15 QF X15 QF W15 V la F 29/18 QF O16 QF O21 V la F 29/2 
#> QF AAMB V la F 33/1 QF M7 QF AAMA QF 18-19B V la F 29/5 QF BBMB QF BB16B 
#> QF CC14B QF CC15 V la F 29/20 QF CC17A QF AA12 QF II 15-16 QF EE16-17 QF CC12 
#> QF EE14-15 QF DD17 QF BB18A QF CCDD19 QF EE17-18B QF FF19B QF KKLL16 QF HH18 
#> QF FF18-19A QF GG16-17 QF MM14 QF II9-10 QF II13-14B QF KKLL18-19 QF LL18 
#> QF LL12-13 QF FF14-15 QF KK15 QF HH11-12 QF MM19-20 QF KKLL17 QF JJ12-13 
#> QF IIJJ19 QF II18-19 QF GGHH13 QF GGHH19 QF JJ18-19B QF JJ14-15 QF HH7-8 
#> QF JJ17-18 QF ZAA17-18 QF CCDD15-16 QF MM10-11 QF Z15 XV XIII QF JJ17 XIX XVI 
#> QF JJ8 XII I XVII VIII XXIV XX IV X VII IX VI V XI XVIII XIV XXIII
#> 
#> Ranking of column elements: 
#> Type 7 Type 4 Type 3 Type 8 Type 9 Type 2 Type 1 Type 6 Type 10 Type 5 Type 16 
#> Type 13 Type 15 Type 12 Type 22 Type C Type 11 Type 17 Type 19 Type 23 Type 14 
#> Type 26 Type 24 Type 21 Type 18 Type 27 Type 20 Type 31 Type 25 Type 29 Type 28 
#> Type 30 Type 32 Type 43 Type 44 Type 48 Type 51 Type 47 Type 54 Type 55 Type 39 
#> Type 38 Type 52 Type 36 Type 56 Type 33 Type 37 Type 42 Type 34 Type 45 Type 46 
#> Type 40 Type 50 Type 49 Type 35 Type 41 Type 53 Type 57 Type 58 Type 63 Type 61 
#> Type 59 Type 62 Type 60 Type 71 Type 65 Type 64 Type 66 Type 72 Type 69 Type 75 
#> Type 68 Type 77 Type 78 Type 67 Type 73 Type 76 Type 74 Type 70 Type 80 Type 81 
#> Type 79
#> 
#> $dat contains the following columns:
#>    $Procrustes1, $Procrustes2 : x,y coords for the Procrustes-fit CA principal scores
#>    $CurveIndex: index of nearest point on refrence curve to the Procrustes-fit CA score point
#>    $Distance: distance of principal score point to nearest point on reference curve
#>    $Rank: ranking of score points projected onto the reference curve
#>    $Type: "row" or "col"
#> $im_seriated: the seriated incidence matrix

Three different plot types are available for a strand object, set in the plot() function via the additional display argument:

  • "ca" produces a Procrustes-fit CA plot as above, with axes Procrustes1 and Procrustes2. This is the default setting.
  • "ref" produces a plot showing the distance of each row and column point from the reference curve. The indices of the reference curve are shown on the x axis while the distance of each point from the curve is shown on the y axis.
  • "im_seriated" produces a matrix plot of the seriated incidence matrix, showing the measure of concentration κ (on which see below).
plot(qf_ser, display = "ca")

plot(qf_ser, display = "ref")

plot(qf_ser, display = "im_seriated")

The resulting incidence matrix may be extracted from the strand by calling $im_seriated:

quattrofontanili.caproc <- qf_ser$im_seriated

Lakhesis Analysis

In the process of exploring matrices for seriated sequences, an investigator is liable to come across many sequences which appear to be well seriated. The question then arises as to how to harmonize or reconcile several partial, seriated sequences, some or many of which may disagree. This problem is non-trivial, analogous to other situations such as e.g. ranked-choice or preference voting, although not strictly identical.

The strategy adopted here, called a “Lakhesis” method, is to iterate simple linear regressions in an agglomerative fashion. The procedure consists of the followings steps:

  1. To initialize, perform pairwise linear regression on all strands (each strand must have at least 4 joint elements with at least one other strand), for row and column elements separately.
    1. For any two strands, select one as the dependent (y) and another as the independent (x) variate.
    2. Remove all elements for there is an NA value in either strand.
    3. Use the regression line, f(x) = β1x + β0, to project the ranks of the independent variate onto the dependent (thereby supplying rankings for any NA values in the dependent variate).
    4. If y ≠ f(x), the mean of y and f(x) is used for that element.
    5. The values of the dependent and regressed independent variates are re-ranked.
  2. The concentration coefficient κ (on which see below) is computed for each pairwise regression.
  3. The pair of strands is chosen that elicits the lowest concentration measure. The regression from this pair is then selected as the dependent variate for the next iteration.
  4. Perform pairwise regression with all remaining strands as the independent variate and the regression from the previous iteration, repeating Steps 1b-3.
  5. Repeat Step 4 until all strands have been regressed, resulting in a single consensus seration.

Linear regression has its virtue in that partial rankings which may be identical in their relative sequence but which may run in reverse will be regressed into the same order, merely with a negative slope.

Given that the order in which strands are regressed has a bearing on the resulting consensus seriation, the use of an optimality measure to determine that order is helpful. Especially if the number of strands were to exceed 10, all possible permutations of the order of strand regression could not be evaluated in a reasonable amount of time. Similarly, PCA (which was initially explored as a method of harmonizing rankings) would only work as a method of ranking imputation if there were at least four elements jointly shared across all strands, a restriction which was undesirable given a potential need to perform consensus seration on a large set of data in piecemeal fashion.

The lakhesize() Function

Computing a consensus seriation for a set of strands via the steps above is performed by the lakhesize() function, which takes as its input a strands object (i.e., a list of individual strands produced by ca_procrustes_ser()).

The lakhesize() function has the following options:

  • pbar Turns the progress bar on or off. Default is TRUE.

A predefined qfStrands data object is included in the package to illustrate functionality of the lakhesize() function, comprised of three strands from the quattrofontanili data:

data("qfStrands")
print(qfStrands)
#> List of 3 strands
L <- lakhesize(qfStrands, pbar = FALSE)
summary(L)
#> Lakhesis analysis of class: lakhesis list 
#> 
#> Ranking of row elements: 
#> XXIII, XIV, XVIII, XI, V, VI, IX, VII, X, IV, XX, VIII, XXIV, XVII, I, XII, 
#> QF JJ8, XVI, XIX, XIII, XV, QF CCDD15-16, QF ZAA17-18, QF HH7-8, QF MM19-20, 
#> QF KK15, QF KKLL17, QF JJ18-19B, QF MM14, QF JJ14-15, QF II13-14B, QF LL12-13, 
#> QF MM10-11, QF IIJJ19, QF GG16-17, QF KKLL18-19, QF II18-19, QF HH11-12, 
#> QF GGHH13, QF II9-10, QF FF18-19A, QF LL18, QF JJ12-13, QF FF14-15, QF GGHH19, 
#> QF JJ17-18, QF HH18, QF KKLL16, QF Z15, QF FF19B, QF JJ17, QF EE17-18B, 
#> QF CCDD19, QF EE14-15, QF DD17, QF BB18A, QF EE16-17, QF CC12, QF AA12, 
#> QF II 15-16, QF CC17A, V la F 29/20, QF CC15, QF CC14B, QF BB16B, QF BBMB, 
#> V la F 33/1, QF 18-19B, V la F 29/5, QF M7, V la F 29/2, QF AAMA, QF AAMB, 
#> QF X15, QF O16, QF O21, V la F 29/15, QF M22, QF W15, V la F 29/18, QF S7
#> 
#> Ranking of column elements: 
#> Type 79, Type 80, Type 81, Type 70, Type 74, Type 76, Type 73, Type 67, 
#> Type 77, Type 78, Type 68, Type 75, Type 69, Type 72, Type 66, Type 64, 
#> Type 65, Type 60, Type 71, Type 62, Type 59, Type 61, Type 63, Type 58, 
#> Type 57, Type 53, Type 41, Type 49, Type 50, Type 55, Type 56, Type 37, 
#> Type 52, Type 54, Type 42, Type 34, Type 47, Type 35, Type 45, Type 43, 
#> Type 51, Type 33, Type 36, Type 39, Type 38, Type 48, Type 44, Type 32, 
#> Type 40, Type 30, Type 46, Type 28, Type 29, Type 25, Type 31, Type 27, 
#> Type 18, Type 20, Type 21, Type 24, Type 26, Type 23, Type 17, Type 19, 
#> Type 14, Type 11, Type 12, Type C, Type 15, Type 5, Type 16, Type 13, Type 22, 
#> Type 6, Type 10, Type 1, Type 2, Type 8, Type 9, Type 3, Type 4, Type 7
#> 
#> Seriated incidience matrix in $im_seriated
#> Coefficients in $coef
#> kappa =  2.967933

The output is contained in a lakhesis object, which includes the following objects as described in the summary() above:

  • seriated sequences of the row and column elements according to the consensus seriation of the input strands
  • the seriated incidence matrix
  • coefficients of concentration and agreement (on which see below)

Critical Measures

The lakhesis package employs three methods to critically evaluate a seriated incidence matrix.

Concentration (κ)

In this package, a measure of concentration has been defined, κ, which is a loss measure with a minimum of 1. Concentration is based on the number of non-contiguous values of 1 attested in the seriated matrix across both columns and rows. The intervals of the index from the lowest-ranked 1 and that of the highest ranked 1 is summed for each row and column, and then divided by twice total sum of the matrix. A seriated matrix that has perfect concentration will thus have κ = 1, and should there occur 0s in between values of 1, the measure of κ will increase. This is performed using the function conc_kappa(), whose documentation contains more information, which takes as input an incidence matrix.

Here, we can can create an object that contains the matrix according to the consensus seriation via Lakhesis:

quattrofontanili.L <- L$im_seriated

Then, one can assess the concentration measure κ for each seriated matrix, first according to the original sequence, second according to Procrustes-fit CA, and third according to the Lakhesis method employed for three investigator-selected strands:

# the original seriation
conc_kappa(quattrofontanili)
#> [1] 4.770784
# using procrustes-fit ca
conc_kappa(quattrofontanili.caproc)
#> [1] 3.115202
# using lakhesis
conc_kappa(quattrofontanili.L)
#> [1] 2.967933

Lower concentration measures inidicate more optimal seriations. Here, consensus via Lakhesis has arrived at a more optimal seriation according to the concentration principle than either just running one seriation using Procrustes-fit CA or the original sequence.

Agreement (ρr2ρc2)

The agreement of each strand’s ranking with that of the resulting consensus seriation also serves as a useful diagnostic to determine if any one strand is in severe disagreement and should be removed before re-running Lakhesis. Agreement is defined as the product of Spearman’s rank correlation coefficient squared for the row and column elements between one seriated matrix and another. In the lakhesize() function, agreement is computed for each strand’s seriation with respect to the consensus seriation, using the function spearman_sq(), which removes NA values automatically and takes a numerical input of the rankings.

Deviance Test for Goodness-of-Fit

The use of a quadratic-logistic model for determining seriations has a long history in ecology, but as a measure for optimality it poses a challenge in that an optimally concentrated seriation would elicit NA results for a logistic regression, owing to the perfect separation of 0/1 values in the columns and rows.

While a deviance-based test for goodness-of-fit is often performed for (nested) models upon the same data, here it is used as an exploratory method to evaluate which specific row and column elements might not fit well, and which one may wish to exclude when prior to performing correspondence analysis. The function element_eval() takes a seriated matrix as its input and will return a list of two data.frames sorted from highest to lowest in terms of the resulting p value from a deviance test using the quadratic-logistic model. Any rows or columns which exhibit perfect separation are first assigned a result of NA. Then, the deviance test is run on the remaining rows and columns. Elements which contain higher p values will have a less optimal fit.

qf_eval <- element_eval(quattrofontanili)
head(qf_eval$RowFit)
#>              id     p.val
#> 32  QF II 15-16 0.9138881
#> 21 V la F 29/20 0.7370760
#> 45    QF KKLL16 0.6815412
#> 37      QF HH18 0.6549297
#> 18  V la F 29/5 0.6458685
#> 19    QF 18-19B 0.6458685
head(qf_eval$ColFit)
#>         id      p.val
#> 23 Type 23 0.28534518
#> 18 Type 18 0.26652412
#> 28 Type 28 0.26581357
#> 14 Type 14 0.21695105
#> 41 Type 41 0.11574604
#> 24 Type 24 0.09176663

Plotting Results

Plotting the results of lakhesize() can be accomplished by stipulating different options for the display argument in a plot() function, taking a lakhesis object as input.

  • "im_seriated" will produce a matrix plot of the consensus seriation, also displaying the concentration measure κ. This is the default.
  • "agreement" will produce a bar graph showing the measures of agreement of each strand with the consensus. Higher values indicate better agreement.
  • "concentration" will produce a bar graph of each strand’s concentration measure κ. Lower values indicate more optimal seriations.
plot(L, display = "im_seriated") 

plot(L, display = "agreement") 

plot(L, display = "concentration") 

These plots are incorporated into the Lakhesis Calculator, and are generated automatically.

Lakhesis Calculator

The commands above can all be used in the console. That said the functionality of the lakhesis package is best articulated via the interactive Lakhesis Calculator, written using shiny, which is initialized by the function LC():

LC()

This executes a shinyApp() function that will launch the calculator in the system browser. In the browser window, a sidebar is located on the left, while four panels are visible in the main dashboard:

  • Seriation Explorer
  • Consensus Seriation
  • Criteria
  • Modify

The calculator initializes with an ideally seriated reference matrix 20 × 20 in size, with row elements labeled “R” and column elements labeled “C.” This reference matrix is produced using the function im_ref():

im_ref( matrix(NA, nrow = 20, ncol = 20) )

Uploading Data

Before selecting the dataset to upload, one can choose whether or not the Calculator will automatically remove row and column elements that are attested only once. This is done in the top left sidebar with the radio button under Before Uploading. The option Use All Data will use the full data when generating the incidence matrix within the calculator. The option Remove Hapax will remove all rows and columns which have only one attestation, since these instances do not establish connective relationships with other rows and columns. This may be desired if one has a large matrix.

To upload data, click the Browse… button at the top of the sidebar under the Choose CSV file heading, and select the qf.csv file exported earlier in the vignette.

Identifying and Selecting Seriated Sequences

The plot in the upper left panel, Seriation Explorer, will automatically adjust from the initial default to display the correspondence analysis (CA) plot which has been Procrustes-fit to its reference curve. The Seriation Explorer contains two tabs, each of which display the same results, but in two different plot formats:

  • The CA-Procrustes Plot shows the typical CA plot, whose scores have been moved to fit the reference curve displayed in gray.
  • The Curve Plot shows the projections of those same points onto the reference curve, from one end to another. Values along the x-axis are the indices of points, which are used to derive a ranking of points, from left to right. The distance along the y-axis gives the squared Euclidean distance from the point to its orthogonal projection on the reference curve.

The ca_procrustes_ser() function is run automatically. The Seriation Explorer is an interactive panel, from which the investigator can select points, to re-run and re-fit correspondence analysis. Points are selected using either the CA plot or the curve reference plot with the mouse cursor via the brushedPoints() function in shiny. While the CA plot is useful for most selections, if the investigator wants to only include points which lie nearest to the reference curve, the curve plot facilitates selection of points within a chosen distance to the curve.

When points are selected, their color should automatically change. Note that the selection tool gathers points, and so if a particular shape of a cluster or grouping is to be selected, the brush tool can be used repeatedly to select multiple points. In order to deselect and start over, one double-clicks on the plot.

After selecting points of interest, the investigator then clicks Recompute with Selection. Correspondence analysis and Procrustes fitting will then be performed again on a matrix constructed only of the selected row and column elements, omitting any rows and columns which contain all zeros. The plots will automatically refresh. When one believes they have found a well seriated sequence, one clicks on the Save Seriation as Strand button. This will log the observed seriation as a strand inside a strands object (no changes occur to the plots), using the strand_add() function.

Creating a Consensus Seriation

After one has selected at least two strands, a consensus seriation can be produced. To do so, click on the Lakhesize Strands button in the sidebar, which will perform function lakhesize() on the selected strands. Plots will automatically appear in Consensus Seriation and Criteria panels. In the Consensus Seriation panel is shown a matrix plot that gives the size and concentration coefficient of the matrix according to the consensus (not all row and column elements in the input data must be included). In the bottom left Criteria panel, the coefficients of agreement and concentration of each strand are displayed on separate tabs.

Modifying Data

Deleting Strands

In the process of performing exploratory seriations, it might become clear that one or more strands are poorly seriated, and hence eliminating them might improve the resulting seriated matrix.

On the Modify panel to the bottom right, one can use the input box labelled Delete Strand (Indices Will Resort Automatically, Undo Will Re-Add as Last Strand), to type in the number of strand to be deleted. Pressing the Delete button will remove that strand and automatically readjust the the consensus seriation. The numbering of the strands automatically adjusts. One can add the deleted strand back in, by pressing the Undo button. The deleted strand is returned to the end of the strand list, and again the consensus seriation is adjusted automatically.

Suppressing Row/Column Elements

There likewise might be a consistent row or column element which appears to be poorly seriated, which the investigator would like to suppress from selection. Following the execution of Lakhesize Strands, one can press the Run Deviance Test button on the sidebar. The tabs marked “Deviance (Rows)” and “Deviance (Columns)” are now populated with a table of the top ten row and column elements which have the highest p~values per the hypothesis test. In the Modify panel, any elements can be entered into the input boxes marked Select Rows/Columns to Suppress, either by selecting them from a list or by typing. The Seriation Explorer will automatically remove the row or column elements from the plot. Suppressing row or column elements does not affect previously selected strands. To restore elements, simply delete them from the suppression list in the Modify panel.

Exporting Results

Exporting results can be done at any time following the command of Lakhesize Strands, either in the course of analysis or at the end. The Export Data button will download a list object in rds format which contains the following objects:

  • $consensus An object of the lakhesis class that contains the information on the consensus seriation (see above on the lakhesize() function)
  • $strands An object the strands class, the inputs used to produce the lakhesis class object

The default filename is automatically timestamped. The rds file can then be read back into R for further analysis.