This tutorial assumes that you have collected data following one of the survey procedures documented in the MAISRC video, “Monitoring Zebra Mussels in Lakes”. Here, we describe how to analyze data collected using dependent double-observer surveys with and without distance sampling. These designs are sometimes referred to as “removal designs” (rather than dependent double-observer designs). We will demonstrate these techniques using data collected on Lake Burgan, Minnesota in the summer of 2018. Both of these survey designs allow you to estimate the probability of detecting mussels, and thus, obtain estimates of density adjusted for imperfect detection. Here, we show how to use design-based estimators of density. We have an R package to go along with the tutorial; it includes all the necessary functions and data to replicate our analyses. You can download the R package here and install from source using the following code:

install.packages(path_to_file, repos = NULL, type="source") #path_to_file is a character string that points to 
  #the downloaded file, e.g., install.packages("ZebraTutorial_0.1.0.tar.gz", repos = NULL, type="source") 
  #if the file is in your working directory

If you use Rstudio you can also install the downloaded file by clicking the Packages tab -> Install -> Install from: Package archive file (.tar.gz). Alternatively you can install from github using the following code:

library(devtools) #this takes awhile to install so install from source if you don't want to bother
install_github("troutinthemilk/Zeebs-Season2/Tutorial/ZebraTutorial", build_vignettes = TRUE)

The remainder of the tutorial assumes that you have been able to install this package. All the data and functions you will need to complete the tutorial are in this package. To access the data and functions in this package type:

library(ZebraTutorial)
library(tidyverse) #we will use some functions from this set of packages

Estimating Density and Quantifying Uncertainty

Our goal is to estimate zebra mussel density in a lake and to quantify uncertainty associated with that estimate. To do this, we need the counts in each transect, denoted as $x_{i}$ for the $i$ th transect, the transect length, $l_{i}$ , and the width, $w$ , to determine the surveyed area, $a_{i} = w l_{i}$ . We estimate the density as:

\begin{aligned} \hat{D} & = \frac{\sum_{i}^{n} x_{i}}{\hat{P} \sum_{i}^{n} w l_{i}}, \end{aligned}

where $\hat{P}$ denotes the estimated probability of detecting a zebra mussel by at least one observer.

The degree of uncertainty in our density estimate will depend on the variance of the counts among the different transects, $v a r (x_{i})$ , and the variance in the estimated detection probability, $v a r (\hat{P})$ .

\begin{aligned} v a r (\hat{D}) & = {\hat{D}}^{2} (\frac{v a r (\sum_{i}^{n} x_{i})}{(\sum_{i}^{n} x_{i})^{2}} + \frac{v a r (\hat{P})}{{\hat{P}}^{2}}) . \end{aligned}

Our estimate of $v a r (\hat{P})$ will depend on the survey design.

Dependent Double-Observer (no distance data)

To conduct the dependent double-observer survey, 2 divers survey 30-meter x 1-meter wide areas delineated using 2 lead lines laid in parallel. The primary diver swims first, marking all the mussels she/he sees within the belt. The secondary diver follows looking for mussels that the first observer missed. We rotated the role of primary and secondary divers at each transect.

The Data

Data should be entered in a manner that ensures reliability. For the datasets used here, we entered data using Google Forms which can be accessed here. This software allowed us to check that no impossible values were entered (for example, transect lengths were numbers between 1 and 30 due to our design). We created a datasheet of detection events and a datasheet that described each transect. The variable “Transect number” is a unique identifier for each transect that links these sheets. Below we print the first few lines from the dataframe of the detection events:

data(DoubleEncounter)
head(DoubleEncounter)
#> # A tibble: 6 x 3
#>   `Observer name` `Transect number`  size
#>   <chr>                       <dbl> <dbl>
#> 1 Aislyn                          2     1
#> 2 Aislyn                          4     2
#> 3 Aislyn                          3     0
#> 4 Aislyn                          6     2
#> 5 Aislyn                          6     1
#> 6 Aislyn                          8     1

Each row contains data from a unique detection event, with the name of the observer (Observer name), the transect number (Transect number), and the number of zebra mussels in the cluster (size).

We then read in the transect data containing the primary and secondary observers, the transect length, and transect number. We create new columns in the transect data that denote the observers’ name and whether that observer was primary or secondary.

data(DoubleTransect)
DoubleTransect    <- DoubleTransect %>% gather(key = observer, value = name, "primary", "secondary") 
head(DoubleTransect)
#> # A tibble: 6 x 4
#>   length `Transect number` observer name  
#>    <dbl>             <dbl> <chr>    <chr> 
#> 1     13                 1 primary  Austen
#> 2     19                 2 primary  Aislyn
#> 3     27                 3 primary  Austen
#> 4     25                 4 primary  Aislyn
#> 5     24                 5 primary  Austen
#> 6     19                 6 primary  Aislyn

Now, we need to format the data for the analysis. We will be using the removal function in the R fisheries stock analysis (FSA) package to calculate the density estimates. This function will estimate the detection probability and variance in counts for us. We can use this information to determine the variance in density. We have written the create.removal function to properly format the data and have included it in the file HelperFuncs.R, which we source here (making the function available to us in the current R session). Tutorials on the estimator used here are available at the FishR site.

library(FSA) #load the FSA library
#> ## FSA v0.8.22. See citation('FSA') if used in publication.
#> ## Run fishR() for related website and fishR('IFAR') for related book.

#returns data in a form that can be used by removal in FSA
removal.list <- create.removal(DoubleEncounter, DoubleTransect)

The Analysis

Here we apply the function removal to the combined transect counts. This function calculates the conditinal variance of the counts and the detection probabilty. We need to combine them to calculate the variance in density.


#get the length of each transect
length.vec  <- DoubleTransect %>% group_by(`Transect number`) %>% summarize(length = first(length)) 

# now sum the results over all transects to estimate
# population size at the transect level 
remove.ests <- removal(removal.list, just.ests=TRUE)
Nhat       <- remove.ests[1]
Nhat.SE    <- remove.ests[2]

# estimate of density along the transect

# estimate of detection probability along the transect
bp.hat      <- remove.ests[5]
bpSE.phat   <- remove.ests[6]

#we need to incorporate variation of detection and variation in the conditional counts into the the density estimate
Dhat      <- Nhat/sum(length.vec$length)
Dhat.SE   <- Dhat*sqrt((Nhat.SE^2/sum(removal.list)^2) + bpSE.phat/bp.hat^2)

We get an estimate of density $\hat{D} =$ 0.22 mussels per square meter with a standard error of 0.07. We estimate the detection to be $\hat{P} =$ 0.81 with a standard error of 0.07.

Dependent Double-Observer Survey with Distance Data

The dependent double-observer survey with distance data is similar to the dependent double-observer survey described above. However, in this case, both divers swim along a single transect line (lead line). Further, whenever a diver detects a mussel or cluster of mussels, the diver also measures the perpendicular distance from the detected mussel(s) to the transect line. The secondary diver again follows the primary diver, looking for mussels that were missed by the primary diver. These distance to detection measurements are then used to model how detection probabilities decline with distance from the transect line. This distance model is then used to correct counts for imperfect detection. The distancesampling.org website provides tutorials on using the R package described here, as well as several other packages that can be used to do distance sampling with either single- or double-observer designs.

The Data

Here we read in the encounter data containing the observer (Observer name), transect number (Transect #), perpendicular distance from each mussel or cluster of mussels from the transect line (distance), and the cluster size (size):

data("DistanceEncounter")
head(DistanceEncounter)
#> # A tibble: 6 x 4
#>   `Observer name` `Transect #` distance  size
#>   <chr>                  <dbl>    <dbl> <dbl>
#> 1 Aislyn                     1     0.64     1
#> 2 Aislyn                     1     0.38     1
#> 3 Aislyn                     2     0.3      1
#> 4 Aislyn                     2     0.38     1
#> 5 Aislyn                     2     0.4      1
#> 6 Aislyn                     2     0.19     1

And now, we read in the transect data containing primary and secondary observers, the transect length, and transect number:

data("DistanceTransect")
head(DistanceTransect)
#> # A tibble: 6 x 4
#>   `Primary observer` `Secondary observer` `Transect length`
#>   <chr>              <chr>                            <dbl>
#> 1 Austen             Aislyn                              11
#> 2 Aislyn             Austen                              17
#> 3 Austen             Aislyn                              29
#> 4 Aislyn             Austen                              25
#> 5 Austen             Aislyn                              26
#> 6 Aislyn             Austen                              18
#>   `Transect number`
#>               <dbl>
#> 1                 1
#> 2                 2
#> 3                 3
#> 4                 4
#> 5                 5
#> 6                 6

# order by transect number
DistanceTransect    <- DistanceTransect[order(DistanceTransect$"Transect number"),]

It is useful to construct some initial diagnostic plots to ensure data were entered correctly and assumptions of distance sampling are met (e.g., detection declines monotonically with distance from the transect line).

hist(DistanceEncounter$distance, col="black", border="white",
     xlab='Distance of detection from transect (m)', main="")

Figure 1. Diagnositic plot of the detection distance, the distance of each detection from the transect line.

Here we see a pattern that is consistent with our expectation - i.e., the number of detections declines with distance from the transect line.

The analysis

We first prepare the data for use with the R package MRDS, making use of a helper function create.removal.Observer.

library(mrds) #load the mrds package
#> This is mrds 2.2.0
#> Built: R 3.5.2; ; 2019-04-25 23:38:01 UTC; unix

# Order the transects
DistanceEncounter     <- DistanceEncounter[order(DistanceEncounter$`Transect #`),]

# Format distance data for use with mrds
DistanceEncounter     <- DistanceEncounter %>% 
        mutate(detected = rep(1, dim(DistanceEncounter)[1]),
                                object=1:dim(DistanceEncounter)[1]) 
distance.rem.dat      <- create.removal.Observer(transect.dat=DistanceTransect, obs.dat=DistanceEncounter)

Based on the relatively slow decline in detection with distance (Figure 1), we chose to use the hazard rate model (dsmodel=~cds(key="hr")) to capture the drop-off of detection with distance. Alternative distance functions in the mrds package are the half-normal (hn), gamma function (gamma), and uniform (unif).

distance.model  <- ddf(method="rem", dsmodel=~cds(key="hr"),
                       mrmodel=~glm(formula=~1),
                       data=distance.rem.dat,
                       meta.data=list(width=1))

print(summary(distance.model))
#> 
#> Summary for rem.fi object 
#> Number of observations               :  79 
#> Number seen by primary               :  56 
#> Number additional seen by secondary  :  23 
#> AIC                                  :  99.13481 
#> 
#> 
#> Conditional detection function parameters:
#>              estimate       se
#> (Intercept) 0.8898575 0.420271
#> 
#>                       Estimate         SE         CV
#> Average primary p(0) 0.9152379 0.05050355 0.05518079
#> 
#> 
#> 
#> Summary for ds object 
#> Number of observations :  79 
#> Distance range         :  0  -  1 
#> AIC                    :  -27.39173 
#> 
#> Detection function:
#>  Hazard-rate key function 
#> 
#> Detection function parameters 
#> Scale coefficient(s):  
#>               estimate        se
#> (Intercept) -0.7774321 0.1764449
#> 
#> Shape coefficient(s):  
#>             estimate        se
#> (Intercept) 1.037199 0.3349329
#> 
#>           Estimate        SE       CV
#> Average p 0.580408 0.0645321 0.111184
#> 
#> 
#> Summary for rem object
#> 
#> Total AIC value =  71.74308 
#>                        Estimate          SE        CV
#> Average p             0.5312114  0.06593617 0.1241242
#> N in covered region 148.7166908 21.72528452 0.1460850

The average probability of detecting a mussel by at least one observer is $\hat{P} = 0.53$ with a standard error of $0.06$ . Below, we overlay the estimated detection model on the histogram of detection distances made by the two observers.

plot(distance.model, which=2, showpoints=FALSE, lwd=2)

The variance in the total counts can be estimated using the dht function in the mrds package. We need to create two tables to link the survey effort and surveyed area to the actual count data. This is done below, then we run the dht function to get the density and it’s standard error. Below we create these tables, then get the density estimate.

w       <- 2 #width of transects
area    <- data.frame(Region.Label=1, Area=sum(DistanceTransect$`Transect length`)*w)
counts  <- table(DistanceEncounter$`Transect #`) #count in each transect
samples <- data.frame(Region.Label=rep(1, length(counts)), Sample.Label=1:length(counts), Effort=DistanceTransect$`Transect length`)

density.est <- dht(distance.model, region.table=area, sample.table=samples)


print(density.est)
#> 
#> Summary for clusters
#> 
#> Summary statistics:
#>   Region Area CoveredArea Effort  n  k        ER      se.ER     cv.ER
#> 1      1  602         602    301 79 15 0.2624585 0.05123375 0.1952071
#> 
#> Abundance:
#>   Label Estimate       se        cv      lcl      ucl       df
#> 1 Total 148.7167 34.40232 0.2313279 93.07522 237.6213 26.80237
#> 
#> Density:
#>   Label  Estimate         se        cv     lcl       ucl       df
#> 1 Total 0.2470377 0.05714671 0.2313279 0.15461 0.3947198 26.80237
#> 
#> Summary for individuals
#> 
#> Summary statistics:
#>   Region Area CoveredArea Effort  n        ER      se.ER     cv.ER
#> 1      1  602         602    301 93 0.3089701 0.07225861 0.2338693
#>   mean.size    se.mean
#> 1  1.177215 0.05332043
#> 
#> Abundance:
#>   Label Estimate       se        cv      lcl      ucl       df
#> 1 Total 175.0715 46.35319 0.2647671 102.1349 300.0937 22.66678
#> 
#> Density:
#>   Label  Estimate         se        cv       lcl       ucl       df
#> 1 Total 0.2908165 0.07699866 0.2647671 0.1696593 0.4984946 22.66678
#> 
#> Expected cluster size
#>   Region Expected.S se.Expected.S cv.Expected.S
#> 1  Total   1.177215    0.06999613    0.05945908
#> 2  Total   1.177215    0.06999613    0.05945908

The output from dht provides two density estimates. The first estimate is for the density of clusters, the second estimate is for the total density of individuals. We want individual density, reading off the output table we get an estimate of $\hat{D} = 0.29$ with a standard error of $0.08$ .