Syntax: The Language Form
Language processing in humans and computers: Part 2
Part 1 was:
Who are chatbots (and what are they to you)? Afterthoughts: Four elephants in a room with chatbots
This is Part 2:
1. Syntax is deep, semantics is arbitrary
- 5.1. Semantic context-sensitivity
- 5.2. Syntactic context-sensitivity
- 5.3. Communication is the process of sharing semantical contexts
Part 3:
Semantics: The Meaning of Language
Part 4:
Language as a Universal Learning Machine
1. Syntax is deep, semantics is arbitrary
People speak many languages. People who speak different languages generally don't understand each other. How is it possible to have a general theory of language?
Life is also diversified in many species, and different species generally cannot interbreed¹. But life is a universal capability of self-reproduction and biology is a general theory of life.
General linguistics is based on Noam Chomsky's Cartesian assumption²: that all languages arise from a universal capability of speech, innate to our species. The claim is that all of our different languages share the same deep structures embedded in our brains. Since different languages assign different words to the same things, the semantic assignments of words to meanings are not a part of these universal deep structures. Chomskian general linguistics is mainly concerned with general syntax. It also studies (or it used to study) the transformations of the deep syntactic structures into the surface structures observable in particular languages, just like biology studies the ways in which the general mechanisms of heredity lead to particular organisms. Oversimplified a little, the Chomskian thesis implied that
- syntax is the main subject of modern linguistics, whereas
-
semantics is studied in complementary ways in
- philosophy of meaning, be it under the title of semiology, or in the many avatars of structuralism; and by different methods in
- search engine engineering, information retrieval indices and catalogs, user profiling, and targeted advertising.
However, the difference between the pathways from deep structures to surface structures as studied in linguistics on one hand and in biology on the other is that
- in biology, the carriers of the deep structures of life are directly observable and empirically studied in genetics, whereas
- in linguistics, the deep structures of syntax are not directly observable but merely postulated, as Chomsky's Cartesian foundations, and the task of finding actual carriers is left to a future science.
This leaves the Cartesian assumption about the universal syntactic structures on a shaky ground. The emergence of large language models may be a tectonic shift of that ground. Most of our early interactions with chatbots seem to suggest that the demarcation line between syntax and semantics may not be as clear as traditionally assumed.
To understand a paradigm shift, we need to understand the paradigm. To stand a chance to understand large language models, we need a basic understanding of the language models previously developed in linguistics. In this lecture and in the next one, we parkour through the theories of syntax and of semantics, respectively.
2. Grammar
2.1. Constituent (phrase structure) grammars
Grammar is trivial in the sense that it was the first part of trivium. Trivium and quadrivium were the two main parts of medieval schools, partitioning the seven liberal arts that were studied. Trivium consisted of grammar, logic, and rhetorics; quadrivium of arithmetic, geometry, music, and astronomy. Theology, law, and medicine were not studied as liberal arts because they were controlled by the Pope, the King, and by physicians' guilds, respectively. So grammar was the most trivial part of trivium. At the entry point of their studies, the students were taught to classify words into 8 basic syntactic categories, going back to Dionysios Trax from II century BCE: nouns, verbs, participles, articles, pronouns, prepositions, adverbs, and conjunctions. The idea of categories goes back to the first book of Aristotle's Organon³. The basic noun-verb scaffolding of Indo-European languages was noted still earlier, but Aristotle spelled out the syntax-semantics conundrum: What do the categories of words in the language say about the classes of things in the world? For a long time, partitioning words into categories remained the entry point of all learning. As understanding of language evolved, its structure became the entry point.
Formal grammars and languages are defined in the next couple of displays. They show how it works. If you don't need the details, skip them and move on to the main idea. The notations are explained among the notes⁴.



The idea of the phrase structure theory of syntax is to start from a lexicon as the set of terminals