This vignette provides a guide to
the R package lakhesis
. Please refer to the paper,
“Lakhesis: Consensus Seriation via Iterative Regression of Partial
Rankings for Binary Data,” for details.
Seriation can be accomplished using a variety of methods. The main
goals of the lakhesis
package, which focus on binary
matrices (i.e., presence/absence, 0/1), are to provide:
While there are already a number of approaches and tools to perform
seriations, lakhesis
uses a technique of fitting CA scores
to those of a curve produced by an ideal, well seriated “reference
matrix” via a Procrustes method (center, scale, and rotate).
Since well seriated matrices are not a given for every data set, it is expedient for users to select their own seriations quickly via an interactive graphical interface, rather than slicing matrices and re-running commands via the console. Investigators can select multiple seriations, moreover, which can then be harmonized into a single consensus seriation.
The lakhesis
package, named for the ancient Greek
goddess who measured the thread of fate, provides this functionality, to
explore and select well seriated sequences as “strands” that are then
“threaded” together into a single consensus seriation. The package can
either be installed from CRAN, or from GitHub, the
latter of which hosts the most recent development version. To install
from GitHub, the library devtools
is needed:
library("devtools")
install_github("scollinselliott/lakhesis", dependencies = TRUE, build_vignettes = TRUE)
This vignette showcases the primary functions of the package along
with the Lakhesis Calculator, a shiny
app whose interface
offers the means to perform and critically examine seriated
sequences.
The Calculator is designed to import csv
files directly
from the user’s file directory. The csv
file must be
formatted in a “long” format where row and column incidences are listed
in rows, pair by pair. To convert data that is already in the form of a
matrix, the im_long()
function can used. The
"quattrofontanili"
data included in the package can serve
as an example. This data set comprises a matrix of 81 rows and 82
columns, recording the incidence of a particular artifact-type (column)
in a tomb (row) from Early Iron Age necropoleis in Southern Etruria (see
?quattrofontanili
).
The data.frame
qf_long
can then be exported
to a csv
file using the write.table()
function, in order to avoid writing row and column headers:
The file qf.csv
is then ready to be uploaded into the
Lakhesis Calculator.
If one has multiple data sets to pool together, it is simply a matter
of adding supplementary rows to the csv
file. The
im_merge()
function assists with creating a single
incidence matrix from two different incidence matrices, overriding any 0
values with 1s in the case of shared row and/or column elements. The
merged matrix can then be converted using im_long()
and
exported, as above.
The Lakhesis Calculator uses correspondence analysis (CA), a popular
method used for seriation especially in the form of detrended
correspondence analysis (DCA). Unlike DCA, lakhesis
seriates observations according to a “reference curve,” since the plot
of principal scores of well seriated rows and columns will have the form
of an arch or horseshoe. Principal scores of the data are fit to the
curve of the reference matrix using a Procrustes method (center, scale,
rotate), with axes are properly labelled “Procrustes1” and “Procrustes2”
since they no longer represent the chi-squared distance between row and
columnt elements, separately. Procrustes fitting is done without
landmark points by minimizing the squared Euclidean distance of scores
to those of the reference matrix.
This task is accomplished with the function
ca_procrustes()
, which takes an incidence matrix as an
input and returns an object procrustean
that contains the
reference and data scores. The reference matrix is generated internal to
ca_procrustes()
using the im_ref()
function.
The plot
function automatically illustrates the fit of the
points to the reference curve:
data("quattrofontanili")
qf_caproc <- ca_procrustes(quattrofontanili)
class(qf_caproc)
#> [1] "procrustean" "list"
print(qf_caproc)
#> Procrustes-fit CA scores of class procrustean list
#> $ref : principal scores: reference points
#> $x : principal scores: row points
#> $y : principal scores: column points
plot(qf_caproc)
In order to determine a seriation for both the row and column points,
scores are projected onto the quadratic model which fits the reference
points. Then, their ranking is taken from one end of the curve to
another. The function ca_procrustes_ser()
takes an
incidence matrix as input, and produces as output a strand
object, which comprises a list
that includes data regarding
the row and column score points, distance, and rankings, as well as the
seriated matrix according those rankings.
qf_ser <- ca_procrustes_ser(quattrofontanili)
class(qf_ser)
#> [1] "strand" "list"
print(qf_ser)
#> Procrustes-fit CA of class: strand list
#> Use summary() for more information.
summary(qf_ser)
#> Procrustes-fit CA of class: strand list
#>
#> Ranking of row elements:
#> QF S7 QF M22 V la F 29/15 QF X15 QF W15 V la F 29/18 QF O16 QF O21 V la F 29/2
#> QF AAMB V la F 33/1 QF M7 QF AAMA QF 18-19B V la F 29/5 QF BBMB QF BB16B
#> QF CC14B QF CC15 V la F 29/20 QF CC17A QF AA12 QF II 15-16 QF EE16-17 QF CC12
#> QF EE14-15 QF DD17 QF BB18A QF CCDD19 QF EE17-18B QF FF19B QF KKLL16 QF HH18
#> QF FF18-19A QF GG16-17 QF MM14 QF II9-10 QF II13-14B QF KKLL18-19 QF LL18
#> QF LL12-13 QF FF14-15 QF KK15 QF HH11-12 QF MM19-20 QF KKLL17 QF JJ12-13
#> QF IIJJ19 QF II18-19 QF GGHH13 QF GGHH19 QF JJ18-19B QF JJ14-15 QF HH7-8
#> QF JJ17-18 QF ZAA17-18 QF CCDD15-16 QF MM10-11 QF Z15 XV XIII QF JJ17 XIX XVI
#> QF JJ8 XII I XVII VIII XXIV XX IV X VII IX VI V XI XVIII XIV XXIII
#>
#> Ranking of column elements:
#> Type 7 Type 4 Type 3 Type 8 Type 9 Type 2 Type 1 Type 6 Type 10 Type 5 Type 16
#> Type 13 Type 15 Type 12 Type 22 Type C Type 11 Type 17 Type 19 Type 23 Type 14
#> Type 26 Type 24 Type 21 Type 18 Type 27 Type 20 Type 31 Type 25 Type 29 Type 28
#> Type 30 Type 32 Type 43 Type 44 Type 48 Type 51 Type 47 Type 54 Type 55 Type 39
#> Type 38 Type 52 Type 36 Type 56 Type 33 Type 37 Type 42 Type 34 Type 45 Type 46
#> Type 40 Type 50 Type 49 Type 35 Type 41 Type 53 Type 57 Type 58 Type 63 Type 61
#> Type 59 Type 62 Type 60 Type 71 Type 65 Type 64 Type 66 Type 72 Type 69 Type 75
#> Type 68 Type 77 Type 78 Type 67 Type 73 Type 76 Type 74 Type 70 Type 80 Type 81
#> Type 79
#>
#> $dat contains the following columns:
#> $Procrustes1, $Procrustes2 : x,y coords for the Procrustes-fit CA principal scores
#> $CurveIndex: index of nearest point on refrence curve to the Procrustes-fit CA score point
#> $Distance: distance of principal score point to nearest point on reference curve
#> $Rank: ranking of score points projected onto the reference curve
#> $Type: "row" or "col"
#> $im_seriated: the seriated incidence matrix
Three different plot types are available for a strand
object, set in the plot()
function via the additional
display
argument:
"ca"
produces a Procrustes-fit CA plot as above, with
axes Procrustes1 and Procrustes2. This is the default
setting."ref"
produces a plot showing the distance of each row
and column point from the reference curve. The indices of the reference
curve are shown on the x axis
while the distance of each point from the curve is shown on the y axis."im_seriated"
produces a matrix plot of the seriated
incidence matrix, showing the measure of concentration κ (on which see below).The resulting incidence matrix may be extracted from the
strand
by calling $im_seriated
:
In the process of exploring matrices for seriated sequences, an investigator is liable to come across many sequences which appear to be well seriated. The question then arises as to how to harmonize or reconcile several partial, seriated sequences, some or many of which may disagree. This problem is non-trivial, analogous to other situations such as e.g. ranked-choice or preference voting, although not strictly identical.
The strategy adopted here, called a “Lakhesis” method, is to iterate simple linear regressions in an agglomerative fashion. The procedure consists of the followings steps:
NA
value in either
strand.NA
values in the
dependent variate).Linear regression has its virtue in that partial rankings which may be identical in their relative sequence but which may run in reverse will be regressed into the same order, merely with a negative slope.
Given that the order in which strands are regressed has a bearing on the resulting consensus seriation, the use of an optimality measure to determine that order is helpful. Especially if the number of strands were to exceed 10, all possible permutations of the order of strand regression could not be evaluated in a reasonable amount of time. Similarly, PCA (which was initially explored as a method of harmonizing rankings) would only work as a method of ranking imputation if there were at least four elements jointly shared across all strands, a restriction which was undesirable given a potential need to perform consensus seration on a large set of data in piecemeal fashion.
lakhesize()
FunctionComputing a consensus seriation for a set of strands via the steps
above is performed by the lakhesize()
function, which takes
as its input a strands
object (i.e., a list
of
individual strand
s produced by
ca_procrustes_ser()
).
The lakhesize()
function has the following options:
pbar
Turns the progress bar on or off. Default is
TRUE
.A predefined qfStrands
data object is included in the
package to illustrate functionality of the lakhesize()
function, comprised of three strands from the
quattrofontanili
data:
data("qfStrands")
print(qfStrands)
#> List of 3 strands
L <- lakhesize(qfStrands, pbar = FALSE)
summary(L)
#> Lakhesis analysis of class: lakhesis list
#>
#> Ranking of row elements:
#> XXIII, XIV, XVIII, XI, V, VI, IX, VII, X, IV, XX, VIII, XXIV, XVII, I, XII,
#> QF JJ8, XVI, XIX, XIII, XV, QF CCDD15-16, QF ZAA17-18, QF HH7-8, QF MM19-20,
#> QF KK15, QF KKLL17, QF JJ18-19B, QF MM14, QF JJ14-15, QF II13-14B, QF LL12-13,
#> QF MM10-11, QF IIJJ19, QF GG16-17, QF KKLL18-19, QF II18-19, QF HH11-12,
#> QF GGHH13, QF II9-10, QF FF18-19A, QF LL18, QF JJ12-13, QF FF14-15, QF GGHH19,
#> QF JJ17-18, QF HH18, QF KKLL16, QF Z15, QF FF19B, QF JJ17, QF EE17-18B,
#> QF CCDD19, QF EE14-15, QF DD17, QF BB18A, QF EE16-17, QF CC12, QF AA12,
#> QF II 15-16, QF CC17A, V la F 29/20, QF CC15, QF CC14B, QF BB16B, QF BBMB,
#> V la F 33/1, QF 18-19B, V la F 29/5, QF M7, V la F 29/2, QF AAMA, QF AAMB,
#> QF X15, QF O16, QF O21, V la F 29/15, QF M22, QF W15, V la F 29/18, QF S7
#>
#> Ranking of column elements:
#> Type 79, Type 80, Type 81, Type 70, Type 74, Type 76, Type 73, Type 67,
#> Type 77, Type 78, Type 68, Type 75, Type 69, Type 72, Type 66, Type 64,
#> Type 65, Type 60, Type 71, Type 62, Type 59, Type 61, Type 63, Type 58,
#> Type 57, Type 53, Type 41, Type 49, Type 50, Type 55, Type 56, Type 37,
#> Type 52, Type 54, Type 42, Type 34, Type 47, Type 35, Type 45, Type 43,
#> Type 51, Type 33, Type 36, Type 39, Type 38, Type 48, Type 44, Type 32,
#> Type 40, Type 30, Type 46, Type 28, Type 29, Type 25, Type 31, Type 27,
#> Type 18, Type 20, Type 21, Type 24, Type 26, Type 23, Type 17, Type 19,
#> Type 14, Type 11, Type 12, Type C, Type 15, Type 5, Type 16, Type 13, Type 22,
#> Type 6, Type 10, Type 1, Type 2, Type 8, Type 9, Type 3, Type 4, Type 7
#>
#> Seriated incidience matrix in $im_seriated
#> Coefficients in $coef
#> kappa = 2.967933
The output is contained in a lakhesis
object, which
includes the following objects as described in the
summary()
above:
The lakhesis
package employs three methods to critically
evaluate a seriated incidence matrix.
In this package, a measure of concentration has been defined, κ, which is a loss measure with a
minimum of 1. Concentration is based on the number of non-contiguous
values of 1 attested in the seriated matrix across both columns and
rows. The intervals of the index from the lowest-ranked 1 and that of
the highest ranked 1 is summed for each row and column, and then divided
by twice total sum of the matrix. A seriated matrix that has perfect
concentration will thus have κ = 1, and should there occur 0s in
between values of 1, the measure of κ will increase. This is performed
using the function conc_kappa()
, whose documentation
contains more information, which takes as input an incidence matrix.
Here, we can can create an object that contains the matrix according to the consensus seriation via Lakhesis:
Then, one can assess the concentration measure κ for each seriated matrix, first according to the original sequence, second according to Procrustes-fit CA, and third according to the Lakhesis method employed for three investigator-selected strands:
# the original seriation
conc_kappa(quattrofontanili)
#> [1] 4.770784
# using procrustes-fit ca
conc_kappa(quattrofontanili.caproc)
#> [1] 3.115202
# using lakhesis
conc_kappa(quattrofontanili.L)
#> [1] 2.967933
Lower concentration measures inidicate more optimal seriations. Here, consensus via Lakhesis has arrived at a more optimal seriation according to the concentration principle than either just running one seriation using Procrustes-fit CA or the original sequence.
The agreement of each strand’s ranking with that of the resulting
consensus seriation also serves as a useful diagnostic to determine if
any one strand is in severe disagreement and should be removed before
re-running Lakhesis. Agreement is defined as the product of Spearman’s
rank correlation coefficient squared for the row and column elements
between one seriated matrix and another. In the lakhesize()
function, agreement is computed for each strand’s seriation with respect
to the consensus seriation, using the function
spearman_sq()
, which removes NA
values
automatically and takes a numerical input of the rankings.
The use of a quadratic-logistic model for determining seriations has
a long history in ecology, but as a measure for optimality it poses a
challenge in that an optimally concentrated seriation would elicit
NA
results for a logistic regression, owing to the perfect
separation of 0/1 values in the columns and rows.
While a deviance-based test for goodness-of-fit is often performed
for (nested) models upon the same data, here it is used as an
exploratory method to evaluate which specific row and column elements
might not fit well, and which one may wish to exclude when prior to
performing correspondence analysis. The function
element_eval()
takes a seriated matrix as its input and
will return a list
of two data.frame
s sorted
from highest to lowest in terms of the resulting p value from a deviance test using
the quadratic-logistic model. Any rows or columns which exhibit perfect
separation are first assigned a result of NA
. Then, the
deviance test is run on the remaining rows and columns. Elements which
contain higher p values will
have a less optimal fit.
qf_eval <- element_eval(quattrofontanili)
head(qf_eval$RowFit)
#> id p.val
#> 32 QF II 15-16 0.9138881
#> 21 V la F 29/20 0.7370760
#> 45 QF KKLL16 0.6815412
#> 37 QF HH18 0.6549297
#> 18 V la F 29/5 0.6458685
#> 19 QF 18-19B 0.6458685
head(qf_eval$ColFit)
#> id p.val
#> 23 Type 23 0.28534518
#> 18 Type 18 0.26652412
#> 28 Type 28 0.26581357
#> 14 Type 14 0.21695105
#> 41 Type 41 0.11574604
#> 24 Type 24 0.09176663
Plotting the results of lakhesize()
can be accomplished
by stipulating different options for the display
argument
in a plot()
function, taking a lakhesis
object
as input.
"im_seriated"
will produce a matrix plot of the
consensus seriation, also displaying the concentration measure κ. This is the default."agreement"
will produce a bar graph showing the
measures of agreement of each strand with the consensus. Higher values
indicate better agreement."concentration"
will produce a bar graph of each
strand’s concentration measure κ. Lower values indicate more
optimal seriations.These plots are incorporated into the Lakhesis Calculator, and are generated automatically.
The commands above can all be used in the console. That said the
functionality of the lakhesis
package is best articulated
via the interactive Lakhesis Calculator, written using
shiny
, which is initialized by the function
LC()
:
This executes a shinyApp()
function that will launch the
calculator in the system browser. In the browser window, a sidebar is
located on the left, while four panels are visible in the main
dashboard:
The calculator initializes with an ideally seriated reference matrix
20 × 20 in size, with row elements
labeled “R” and column elements labeled “C.” This reference matrix is
produced using the function im_ref()
:
Before selecting the dataset to upload, one can choose whether or not the Calculator will automatically remove row and column elements that are attested only once. This is done in the top left sidebar with the radio button under Before Uploading. The option Use All Data will use the full data when generating the incidence matrix within the calculator. The option Remove Hapax will remove all rows and columns which have only one attestation, since these instances do not establish connective relationships with other rows and columns. This may be desired if one has a large matrix.
To upload data, click the Browse… button at the top
of the sidebar under the Choose CSV file heading, and
select the qf.csv
file exported earlier in the
vignette.
The plot in the upper left panel, Seriation Explorer, will automatically adjust from the initial default to display the correspondence analysis (CA) plot which has been Procrustes-fit to its reference curve. The Seriation Explorer contains two tabs, each of which display the same results, but in two different plot formats:
The ca_procrustes_ser()
function is run automatically.
The Seriation Explorer is an interactive panel, from which the
investigator can select points, to re-run and re-fit correspondence
analysis. Points are selected using either the CA plot or the curve
reference plot with the mouse cursor via the
brushedPoints()
function in shiny
. While the
CA plot is useful for most selections, if the investigator wants to only
include points which lie nearest to the reference curve, the curve plot
facilitates selection of points within a chosen distance to the
curve.
When points are selected, their color should automatically change. Note that the selection tool gathers points, and so if a particular shape of a cluster or grouping is to be selected, the brush tool can be used repeatedly to select multiple points. In order to deselect and start over, one double-clicks on the plot.
After selecting points of interest, the investigator then clicks
Recompute with Selection. Correspondence analysis and
Procrustes fitting will then be performed again on a matrix constructed
only of the selected row and column elements, omitting any rows and
columns which contain all zeros. The plots will automatically refresh.
When one believes they have found a well seriated sequence, one clicks
on the Save Seriation as Strand button. This will log
the observed seriation as a strand
inside a
strands
object (no changes occur to the plots), using the
strand_add()
function.
After one has selected at least two strands, a consensus seriation
can be produced. To do so, click on the Lakhesize
Strands button in the sidebar, which will perform function
lakhesize()
on the selected strands. Plots will
automatically appear in Consensus Seriation and Criteria panels. In the
Consensus Seriation panel is shown a matrix plot that gives the size and
concentration coefficient of the matrix according to the consensus (not
all row and column elements in the input data must be included). In the
bottom left Criteria panel, the coefficients of agreement and
concentration of each strand are displayed on separate tabs.
In the process of performing exploratory seriations, it might become clear that one or more strands are poorly seriated, and hence eliminating them might improve the resulting seriated matrix.
On the Modify panel to the bottom right, one can use the input box labelled Delete Strand (Indices Will Resort Automatically, Undo Will Re-Add as Last Strand), to type in the number of strand to be deleted. Pressing the Delete button will remove that strand and automatically readjust the the consensus seriation. The numbering of the strands automatically adjusts. One can add the deleted strand back in, by pressing the Undo button. The deleted strand is returned to the end of the strand list, and again the consensus seriation is adjusted automatically.
There likewise might be a consistent row or column element which appears to be poorly seriated, which the investigator would like to suppress from selection. Following the execution of Lakhesize Strands, one can press the Run Deviance Test button on the sidebar. The tabs marked “Deviance (Rows)” and “Deviance (Columns)” are now populated with a table of the top ten row and column elements which have the highest p~values per the hypothesis test. In the Modify panel, any elements can be entered into the input boxes marked Select Rows/Columns to Suppress, either by selecting them from a list or by typing. The Seriation Explorer will automatically remove the row or column elements from the plot. Suppressing row or column elements does not affect previously selected strands. To restore elements, simply delete them from the suppression list in the Modify panel.
Exporting results can be done at any time following the command of
Lakhesize Strands, either in the course of analysis or
at the end. The Export Data button will download a
list
object in rds
format which contains the
following objects:
$consensus
An object of the lakhesis
class
that contains the information on the consensus seriation (see above on
the lakhesize()
function)$strands
An object the strands
class, the
inputs used to produce the lakhesis
class objectThe default filename is automatically timestamped. The
rds
file can then be read back into R for further
analysis.