GeoAssignGui

GeoAssign

The Web based program assigns test individuals according to their genotypes or metric traits to groups (e.g. country of origin, population of origin) which are defined by reference individuals. Thus the user needs to upload one data file for the test individuals (“Test file”) and one data file for the reference individuals (“Reference file”). The assignment is done using a nearest neighbour approach (Degen et al. 2017, doi: http://dx.doi.org/10.1016/j.fsigen.2016.12.011).

Structure of the user interface

The user interface is subdivided in three parts:

Upload area for input files
Running the GeoAssign server requires two files: one file (in the following called "test file") that contains data of samples whose origin has to be verified /assigned. The other file (in the following called "reference file") contains the reference samples which are already grouped into populations. You can upload your test and reference file by clicking the corresponding buttons in the upload area. The file format is described in the following section. As a third step, the input data type has to be selected, you can choose between genetic data and metric data like isotopic ratio data.
Area for calculation details
Here, you can adjust the calculation details.
Percentile of most similar individuals (0.1 - 20%): This parameter controls the number of nearest neighbours (reference individuals with small genetic/ metric distance between test individual and reference individual). It is a percentile of all reference individuals ordered according to their genetic / metric distance to the test individual.
Minimum proportion of complete loci / traits (0.1 - 1): Often the data of genotypes or metric traits of the test individuals and reference individuals are incomplete. Missing data are coded by “-1”. This parameter determines which proportion of missing data can be tolerated. E.g. if you select as parameter “0.8” only test individuals and reference individuals will be included in the calculations when they have data for at least 80% of their genotype or metric traits.
Output file details area
The GeoAssign program has two output modi: in the default case, it only generates summary data on the asignment. If you uncheck the "Print only summary files", on additional file for each test sample wil be generated. For more information about the output files, go to the output file section.
The "Prefix for output files" field gives the user the opportunity to choose a prefix for all output files.
After completion of the calculation you get an email containing a download link for your results. Therefore, it is necessary to specify an email address.

Input data files

The input files are simple ASCII files. They can be created with any commonly used editor. (Note: The program requires points "." as separator of decimal places (e.g. "125.12"). If you are not using the UK or US regional configuration make sure that your operating system support this setting.) The file content consits of a header part and a data part. The header part provides a small amount of numbers needed for the assignment calculation. The data part is line based, each line represents one sample. Variables are separated by “,”. Missing data are indicated by “-1”. The format for the reference and the test file is identical and explained in the following figures:

Input file example — **Figure 1:** Example input file with genetic data for diploid species. Note: Haploid data can be entered as homozygotes.

Note: For a successfull GeoAssign calculation, it is essential that the header and the data header line of test and reference file are identical!

Explanatory notes on the different fields:

Header: The name of the data set can be any string. The number of individuals has to be equal to the number of sample lines in this file. The number of diploid gene loci (for genetic data) or the number of quantitative traits has to correspond to the number of these fields in the file. Please note: in case of genetic data, each gene loci consists of two alleles! The different alleles must be coded by numerical values. There is no restriction in terms of number of alleles.The highest number has to be specified in the header (see Figure 1).
Data header line: The order of the different data fields is fix and has to be identical to the order in the example above. The number of loci / traits is not restricted. Each locus name has to be duplicated in the header line.
Data (Sample) line:

Sample ID: The Sample ID has to be a string containg alphabetic characters, numbers, underscores and whitespaces.
Group of the sample: In the test file, the assumed group membership for the individual is denoted. In the reference file, this field gives information about the group mempership of this individual.
Coordinates: The coordinates define the location of the individual. The notation has to be in decimal degrees for longitude and latitude!
Loci or traits: Alleles must be coded by numerical values. Missing data are indicated by “-1”. Each locus consists of two alleles. Haploid data can be entered as homozygotes. Other traits, like isotopic ratio, are single values.

Output files

The programme computes four different types of output files. The file names are composed by the user defined prefix and a suffix specific for each file type:

Name	Content
prefix_Index.csv Toggle details	For each tested individual the values of the assignment indices in the different reference groups. The index represents the proportion of the nearest neighbours weighted by size of the reference group.

prefix_Ex_Prob.csv Toggle details	For each tested individual the values of the exclusion probabilities in the different reference groups. As an indicator of exclusion (outlier) the programme computes the relative frequency of the smaller genetic/metric distances among pairs of individuals in each reference group compared to the distances among test and reference individual)

prefix_Ind_Num.txt Toggle details	An ASCII text file for each tested individual “Ind_Num” with the genetic/metric distance among the test individual and all reference individuals ordered from the smallest to the largest distance.

prefix_Indentical_Genotypes.csv	This file is only optionally generated for genetic data. It includes a list of reference individuals with identical genotypes

Further questions

If you have further questions or comments on the GeoAssign web server, please contact us.

GeoAssign:

Assignment of origin using genetic, phenotypic and geographic information

GeoAssign

Structure of the user interface

Input data files

Output files

Further questions