In ecology we are often interested in how similar two locations are with respect to their species composition. In other words, we often want to know how similar the species lists are at the two sites. Do the two sites have all the same species, completely different sets of species, something in the middle? Patterns related to the similarity of species composition can provide information on the operation of important ecological processes, like competition and dispersal, and can also allow us to estimate the total number of species occuring in a large area based on a small number of samples.

There are a number of different ways of calculating the similarity of species composition among sites, but one of the classic and still most popular approaches is the Jaccard index. The Jaccard index is calculated simply as the number of species that are shared by the two sites divided by the total number of species that occur at both sites combined. To be precise,

`J = C / (S(A) + S(B) - C)`

, where`J`

is the Jaccard index,`C`

is the number of species shared by the two sites, and`S(A)`

and`S(B)`

are the number of species at Site A and Site B respectively. Another way of saying this (and one that is quite useful for this assignment) is that the Jaccard index is equal to the size of the intersection of the two species lists divided by the size of the union of the two species lists.Determine the Jaccard similarity for all pairs of sites in a tallgrass prairie from Oklahoma published by McGlinn et al. 2010. The data you will need is in the species presence table.This dataset includes information on the scales at which species were present and their location within each plot (the Corner and Scale columns), but for the purposes of this analysis we are only interested in whether or not the species occurs within the plot.

Use sets as described below to answer this question. Calculate the similarities within each year. Save the results to a csv file where the first column is the year, the second column is the plot id for one of the two plots, the third column is the plot id for the other of the two plots, and the fourth column is the Jaccard similarity.

#### Using sets

There is only one requirement for how you go about answering this problem and that is that you use sets to do so. Specifically your solution must include a function that calculates the Jaccard similiarity between a pair of sites when passed two sets as arguments. Each set is a species list for one site. This is one of the easiest to implement, most readable, and most computationally efficient ways to solve this problem.

#### Problem decomposition

When tackling a broad problem like this it is always important to think about how you are going to decompose the problem into manageable pieces. Take a few minutes to think about how you would approach this problem before following the steps outlined below. Sketch them out in a text file, or by writing out just the final commands you will use (i.e., none of the details, just the major function names and calls to the functions) in your program, or on a piece of paper (for those of you who are old school like me).

Some measures of similarity, like the Jaccard index in the Sets 1, are based only on the presence/absence of species. Another major class of similarity measures also includes information on their relative abundance in the community. One of these measures is the Euclidean distance (actually the Euclidean distance measures difference not similarity so the similarity measure is 1 - Euclidean distance), which is calculated as the

`sqrt(sum((N1_i - N2_i)^2))`

, where`N1_i`

is the relative abundance of species`i`

at site 1 and`N2_i`

is the relative abundance of species`i`

at site 2 (including zeros). Relative abundance is the number of individuals of a species divided by the total abundance of all species at the site.#### Data

Use the data from McGlinn et al. 2010. We need some information on the relative prevalence of the different species at the different sites so this time download the Cover table. Use the

`cover`

column as the measure of N (we often work with cover instead of number of individuals in plant communities). For this analysis we decide that instead of keeping the years in the analysis separate we want to combine the data from all of the years to get a longer time-scale picture.#### Using dictionaries

Write a function that calculates the Euclidean distance for two sites when passed two dictionaries as arguments. Each dictionary should hold the information on the species identity of all species occurring at a site and the associated total cover of that species.

Write a series of commands that takes the imported data and creates one dictionary for each site that includes the species names and associated abundances for each site, where the abundance is the sum of all of the values in the cover column, for each species at the site. Then pass all possible pairs of dictionaries to your function for calculating the Euclidean distance between each pair of sites. Save the results to a csv file where the first column is the plot id for one of the two plots, the second column is the plot id for the other of the two plots, and the third column is the 1 minus the Euclidean Distance (our abundance based measure of similarity.