philosophy.rmd


---
title: "Philosophy"
output:
  html_document:
    fig_cap: yes
    highlight: tango
    smooth_scroll: no
    theme: flatly
    toc: yes
    toc_float: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

The idea of the [Typological Atlas of Daghestan](http://lingconlab.ru/dagatlas/) was to create a tool for the visualization of linguistic features typical of the languages of Daghestan.

Although the language sample underlying [TALD](http://lingconlab.ru/dagatlas/) has grown to include a number of neighboring languages (more on that in [Language sample](#the-language-sample) below), the core of the project remains Daghestan.

Our aim is to achieve maximum coverage of languages and dialects. Datasets on particular topics should be updatable, so that our map visualizations become incrementally more accurate. In order to make the resource updatable, contributors can choose to maintain the right to approve or reject any proposed updates or corrections, or they can choose to waive that right. Note that if you want to stay involved, this also entails a responsibility to review and reply to propositions within a certain time-frame. See [Step 5](steps.html) on how to propose changes to a dataset or chapter that is already published. 

Data on linguistic features is collected primarily from descriptive literature; as a result, the Atlas can also be helpful in bibliographical research.

# Datapoints

The initial approach of [TALD](http://lingconlab.ru/dagatlas/) assigned one value for a linguistic feature to each language, based on a representative doculect. 

This information could then be mapped onto generalized language datapoints, or all villages where the language in question is spoken.

Below are two possible visualizations for the same dummy feature: the initial consonant of various cognates meaning 'bridge'.

```{r, echo=FALSE,out.width="40%", out.height="20%",fig.cap="General language datapoints vs. village datapoints",fig.show='hold',fig.align='center'}

knitr::include_graphics(c("images/walsex.png","images/taldex.png"))
``` 

<center>
|language|feature                      |value|form |
|--------|-----------------------------|-----|-----|
|Avar    |Initial consonant of 'bridge'|ƛ'   |ƛ'o  |
|Khwarshi|Initial consonant of 'bridge'|t'   |t'eru|
|Karata  |Initial consonant of 'bridge'|ƛ'   |ƛ'eru|
</center>

A benefit of the visualization on the right, is that it shows the distribution and size of language communities more accurately. A drawback is that it leads to gross overgeneralization and erases dialectal differences, because it is not based on more data than the visualization on the left. 

All of the Avar villages, for example, are colored according to data from Standard Avar, while we know that 'bridge' in the Zaqatala dialect spoken in Northern Azerbaijan is pronounced *kːjo*. These villages should thus have a different value.

# Current approach

To improve the accuracy of our visualizations, we currently collect all attested values for a given feature, taking into account any idiom we have data on, including standard languages, dialects spoken in multiple villages and single-village idioms. 

**The visualization shows the most accurate level of granularity available for each village / point on the map.**

For example, the Andi language is spoken in 17 villages. There are 9 main villages, each of which has its own idiom. These can be divided into two main dialects: Upper Andi and Lower Andi. In addition, there are 8 villages for which we have no information on the variety spoken there.

Now let us look at a relatively straightforward linguistic feature like **Number of noun classes**, for which we have general data on both the Upper and Lower dialects, and more accurate information on several villages from the Upper group. One of these villages (Rikvani) even has a value that differs from the other varieties we have data for.

The table below summarizes the different values observed for the language, divided by type of idiom (the number of noun classes is indicated between brackets and each value is color-coded).

|Language|Toplevel dialect|Village|
|--------|----------------|-------|
|Andi <span style="color:Firebrick">●</span> (5)  |Upper Andi <span style="color:Firebrick">●</span>  (5)    |Rikvani <span style="color:Thistle">●</span> (6)|
|        |Lower Andi <span style="color:MistyRose">●</span> (3)    |       |

The diagram below shows the dialect grouping of Andi villages and their values for the noun class feature. At the center is the language as a whole: it has the same value as the Upper group and the eponymous village dialect of Andi, which are most representative of the language as a whole. On the map, the unclassified Andi villages (colored grey in the scheme below) will be colored according to the general language information.

```{r, echo=FALSE}

library(DiagrammeR)

DiagrammeR::grViz("digraph {
  
graph[layout = neato, rankdir = LR]

node [fontsize = 14,
      shape = oval, 
      style = filled, 
      fillcolor = Firebrick, 
      fontcolor = white,
      color = Firebrick]
        
# language      
      Andi

node [fontsize = 10]

# dialects
      Upper
      
node [shape = oval, 
      style = filled, 
      fillcolor = WhiteSmoke, 
      fontcolor = black,
      color = WhiteSmoke]
      Other
      
node [shape = oval, 
      style = filled, 
      fillcolor = MistyRose, 
      fontcolor = black,
      color = MistyRose]
      
      Lower

# Andi-Upper-Villages

node [fontsize = 8,
      shape = oval,
      style = filled, 
      fillcolor = Firebrick, 
      fontcolor = white,
      color = Firebrick]
Chanko
Zilo
Ashali
Andiv [label = 'Andi']
Gunkha
Gagatli

# Rikvani

node [shape = oval,
      style = filled, 
      fillcolor = Thistle, 
      fontcolor = black,
      color = Thistle]
Rikvani

# Lower villages

node [shape = oval, 
      style = filled, 
      fillcolor = MistyRose, 
      fontcolor = black,
      color = MistyRose]

Kvankhidatli
Muni

# Other villages

node [shape = oval, 
      style = filled, 
      fillcolor = WhiteSmoke, 
      fontcolor = black,
      color = WhiteSmoke]
Mekheturi
Shivor
Khando
Rushukha
Novogagatli
Aytkhan
Dzhugut
Tsibilta

edge [color = black, arrowhead = none]

Andi -> {Other, Upper, Lower}

Other -> {Mekheturi, Shivor, Rushukha, Khando, Tsibilta, Novogagatli, Aytkhan, Dzhugut}
Upper -> {Andiv, Rikvani, Gagatli, Zilo, Ashali, Chanko, Gunkha} 
Lower -> {Muni, Kvankhidatli}
    
}")


```

The coverage of the Atlas is far from sufficient because we lack data for many dialects, and we do not know the dialect affiliation of a large number of villages in the area. We compensate for this shortcoming by encoding the level of accuracy/granularity for each datapoint, and allowing the user to toggle which levels to display (see [Map visualization](#Map-visualization) below). Ideally, our datasets will be updated when new information becomes available.

# Map visualization{.tabset .tabset-fade .tabset-pills}

[TALD](http://lingconlab.ru/dagatlas/) currently offers three different map visualizations:

1. **Language and feature** shows the language affiliation (inner dot) and the value for the linguistic feature (outer dot) for each village.
2. **Data granularity** allows the user to show only certain levels of data accuracy for the feature. For example you can uncheck "language" to remove all the dots that were colored according to general information about the language in the absence of more accurate data. In this case it will remove all Andi villages that have no dialect classification.
3. **General datapoints** displays one datapoint for each language in the sample, showing the language affiliation (inner dot) and the value for the linguistic feature (outer dot) for each point.

You can click on a datapoint to view a pop-up window with the name of the language (with a link to the [Glottolog](https://glottolog.org) database), the village, the granularity of data used to color this datapoint, and the value.

```{r, echo=FALSE, message=FALSE}

# packages

library(tidyverse)
library(lingtypology)

# load data

feature <- read_tsv("dummy_data/dummy_feature.csv")
villages <- read_tsv("dummy_data/dummy_villages.csv")

# remove data not for mapping

feature <- feature[(feature$map == "yes"),]

# split feature data into dialect levels

feature_group <- feature %>%
  group_by(type) %>%
  group_split()

feature_tl <- data.frame(feature_group[[1]])
feature_v <- data.frame(feature_group[[2]])

feature_tl$granularity <- "toplevel dialect"
feature_v$granularity <- "village dialect"

# merge feature data with villages dataset

## create matching columns

colnames(feature_tl)[colnames(feature_tl) == "idiom"] <- "toplevel_dialect"
colnames(feature_v)[colnames(feature_v) == "idiom"] <- "village_dialect"

## toplevel dialect data

tlevel_villages <- merge(villages, feature_tl, by = "toplevel_dialect")
v_villages <- merge(villages, feature_v, by = "village_dialect")

# [this generates a lot of garbage columns]

## combine dialect data of different granularity

dialect_villages <- full_join(v_villages, tlevel_villages, by = "village")

# [not very convenient, data from different datasets is in different columns]

## do a simple rbind instead

dialect_villages2 <- rbind(v_villages, tlevel_villages)

dialect_villages3 <- dialect_villages2[!duplicated(dialect_villages2$village),]

# [this gives the desired result, but it's cumbersome and will fail 
# if we happen to have multiple villages with the same name in our set]

# IMPORTANT: SETS SHOULD BE MERGED IN THE RIGHT ORDER (HIGH GRAN - LOW GRAN)
# SO THAT THE DUPLICATES WITH THE HIGHEST GRANULARITY ARE KEPT

## and now for the villages for which we have no data

# [*adds column for villages that lack a dialect affiliation*]

### isolate general language data

feature_l <- feature %>%
  filter(genlang_point == "yes") %>%
  mutate(granularity = "language") %>%
  mutate(default_level = lang) %>%
  select(-idiom)

### create a set of unaffiliated villages

#lost_villages <- villages[villages$default_level == "yes",]

### make them match

### merge feature data and village set

lang_villages <- merge(villages, feature_l, by = "default_level")

### add to the refined set

alldata <- full_join(dialect_villages3, lang_villages, by = "village")

### и еще раз волшебный rbind

alldata2 <- rbind(dialect_villages3, lang_villages)
alldata3 <- alldata2[!duplicated(alldata2$village),]

```

## 1. Language and feature

```{r, echo=FALSE, message=FALSE}

map.feature(alldata3$lang.x,
            latitude = alldata3$lat,
            longitude = alldata3$lon,
            features = alldata3$lang.x,
            color = "#003366",
            title = "Language",
            label = alldata3$village,
            stroke.features = as.factor(alldata3$value),
            stroke.color = c("MistyRose", "Firebrick", "Thistle"),
            stroke.title = alldata3$feature[1],
            popup = paste("<b>Village:</b>", alldata3$village, "<br>",
                          "<b>Data:</b>", alldata3$granularity, "<br>",
                          "<b>Value:</b>", alldata3$value),
            zoom.control = T)

```

## 2. Data granularity

```{r, echo=FALSE, message=FALSE}

# единсвтенное, мне хотелось бы, чтобы можно было посмотреть в popup еще и название идиома рядом с его левелом, например: Upper, toplevel dialect, но не знаю как это реализовать

map.feature(alldata3$lang.x,
            latitude = alldata3$lat,
            longitude = alldata3$lon,
            features = as.factor(alldata3$value),
            color = c("MistyRose", "Firebrick", "Thistle"),
            width = 10,
            title = alldata3$feature[1],
            legend.position = "bottomleft",
            label = alldata3$village,
            control = alldata3$granularity,
            popup = paste("<b>Village:</b>", alldata3$village, "<br>",
                          "<b>Data:</b>", alldata3$granularity, "<br>",
                          "<b>Value:</b>", alldata3$value),
            zoom.control = T)

```

## 3. General datapoints

```{r, echo=FALSE, message=FALSE}

map.feature(feature_l$lang,
            features = feature_l$lang,
            color = "#003366",
            title = "Language",
            label = feature_l$lang,
            stroke.features = as.factor(feature_l$value),
            stroke.color = c("MistyRose", "Firebrick", "Thistle"),
            stroke.title = feature_l$feature[1],
            zoom.control = T,
            zoom.level = 7)

```

# The language sample

As mentioned earlier, [TALD](http://lingconlab.ru/dagatlas/) was originally conceived of as a resource about the languages of Daghestan. 

[The East Caucasian villages dataset](https://github.com/sverhees/master_villages) -- a dataset that contains a list of villages in the eastern Caucasus, their coordinates, and the languages spoken there -- was created as a basis for visualizations in the Atlas. Initially it covered all villages of Daghestan and some East Caucasian speaking communities in Georgia and Northern Azerbaijan.

Villages of Chechnya and Ingushetia, and several more communities in Azerbaijan and Georgia, were added later. It was relevant to include these extra datapoints outside of Daghestan for the development of areal hypotheses.

---

## Northeastern Daghestan

Villages located in the northeastern part of Daghestan are not displayed in the Atlas. This area was settled only relatively recently, and is much more ethnically mixed than the rest of the area, as you can see [here](https://sverhees.github.io/master_villages/maps_new.html). A large part of it remains virtually uncharted (though see Yuri Koryakov's efforts to close this gap [here](https://jirzik.livejournal.com/3002.html)), and is currently undergoing changes.

In addition, little to nothing is known about the varieties of the languages spoken there. In some cases we know the village of origin for most of its inhabitants (e.g. Novogagatli was founded by settlers from the Andi village Gagatli), but we do not know how well their dialect is preserved and to what extent their population is ethnically and linguistically homogeneous.

Therefore, we decided to exclude the newly settled region from our visualization (for now).

---

## List of languages

Below is a list of the languages included in our sample, grouped by language family and branch.

#### East Caucasian

* **Avar**
  * Avar

* **Andic**
  * Akhvakh
  * Andi
  * Bagvalal
  * Botlikh
  * Chamalal
  * Godoberi
  * Karata
  * Tindi
  
* **Tsezic**
  * Bezhta
  * Hinuq
  * Hunzib
  * Khwarshi
  * Tsez

* **Dargwa**
  * Standard Dargwa
  * *Chirag*
  * Itsari
  * Kaitag
  * Kubachi
  * Mehweb
  * Tanty
 
<font color = "dimgray">Dargwa is considered a single language with a number of highly divergent dialects by some, and a group of distinct but related languages by others. In our map visualizations with one dot per language, several Dargwa varieties are shown, because they are sufficiently divergent. However, since our data comes from descriptive literature, we are unfortunately bound to some extent to the post 1930s view of Dargwa as a single language. In our [villages dataset](https://sverhees.github.io/master_villages/maps_new.html#dialects), for example, a variety like Mehweb forms part of the Northern Dargwa dialect group. We cannot rearrange this underlying tree-structure too drastically for practical reasons. 

Unfortunately we do not have a full reference grammar for *Chirag* yet, so you can leave the row for Chirag empty for now.
</font color>   

* **Lak**
  * Lak

* **Lezgic**
  * Agul
  * Archi
  * Budukh
  * Kryz
  * Lezgian
  * Rutul
  * Tabasaran
  * Tsakhur
  * Udi

* **Khinalug**
  * Khinalug

* **Nakh**
  * Chechen
  * Ingush
  * Tsova-Tush (Batsbi)

#### Indo-European

* **Armenic**
  * Armenian

<font color = "dimgray">Armenian appeared in our sample due to a small village in northeastern Daghestan where the language is spoken. Although the village was lost after we removed this area from our visualizations, we decided to keep Armenian as an important language of the region.</font color>

* **Iranian**
  * Tat

#### Kartvelian

* Georgian

<font color = "dimgray">Georgian is spoken in the Qakh district of Azerbaijan, alongside Tsakhur and Azerbaijani. It is also an important neighboring language.</font color>

#### Turkic

* **Kipchak**
  * Kumyk
  * Nogai

* **Oghuz**
  * Azerbaijani