As of now, this goal is very far off, and we are happy if we can make progress on smaller subtasks, even if they do not achieve perfect accuracy. The problem studied in this thesis is one such subtask, and can be described as follows: Given a large collection of written text in a given natural language, can a computer, without any specific knowledge about the language, extract a description of how words are conjugated in that language? The problem is often referred to as Unsupervised Learning of Morphology, but also Automatic Induction of Morphology, Morpheme Discovery, Word Segmen- tation, Algorithmic Morphology, quantitative Morphsegmentierung in German and other variants have been used.
Of these, Unsupervised Learning of Mor- phology ULM is fairly common and faces the least risk of misunderstanding, so it will be used throughout the present work. In the Computer Science tradition, the solution to task such as this amounts to a providing a formal description of the problem in terms of sets, strings, logi- cal conditions and the like into which real-world instances are approximated, b providing a step-by-step description of a method, i.
Remarkably, in the s, long before the Computer Science had matured as a field, and long before computers became practical to use, so-called structural linguists were asking for a solution of the exactly the same kind to the ULM and related problems, but from a dif- ferent perspective. The interest was not so much putting computers to work as to learn how linguistic analysis could be understood, which has particular im- plications for linguistic theory and possibly child language acquisition.
As with most work in Language Technology, the present work will draw on experiences from both Computer Science and Linguistics, and hopefully contribute to all. The ULM problem is stated above in rather abstract terms. All these aspects with be elaborated on in the thesis. No knowledge at all of forms is to be supplied but a small number of parameters and assumptions about suffix-length can be tolerated, whereas running time is not a priority. Word-form analysis, or morphological analysis see below , is generally the first step in computational analysis of natural language, and as such has a wide variety of LT applications, including Machine Translation, Document Catego- rization and Information Retrieval.
ULM can also serve to boost investigations in Linguistics, especially the subfields Quantitative Linguistics and Linguistic Typology, and potentially contribute to linguistic theory.
A legitimate question is about the stipulation that distributional criteria alone should serve as the only source of knowledge for the computer. Why cannot a little or a lot of human knowledge about a language be hard-wired in order to describe how words are conjugated? This is indeed an option, and has been the way to handle the matter for virtually all languages committed to com- putational treatment, but it normally requires a lot of human effort. Roughly the amount of work of an MA thesis is needed to computationally implement conjugational patterns and an unspecified but huge amount of work to list le- gal lexical items.
First, it would be a great benefit to rid us of the human effort of implementing conjugational patterns for the next range of languages to receive computational treatment. Second, even for languages which have this already, along with huge lists of lexical items, open domain texts will always contain a fair share of inflected previously unknown words, that are not in the lexicon Forsberg et al. There has to be strategy for such out-of-dictionary words — a ULM-solving algorithm is one possibility.
It could also turn out that the ULM-problem cannot, in some sense, be solved without explicit human-derived linguistic knowledge. If such a proof, or a convincing argument, is found this constitutes a resolution to the ULM-problem as good as one which proves the existence of an ULM-solving algorithm.
Forsberg Languages of the World 3 2 Languages of the World The work described in the second part of this thesis is in the area of Linguistics, here defined as the study of natural languages. More specifically, the work in this thesis falls in the subfield of Linguistic Typology, or the systematic study of the unity and variation of the languages of the world. As is well-known, the everyday usage of the word language, does not precisely correspond to this delineation, as other factors, such as attitudes or political power, play a role in forming the everyday status.
Then it is logically possible that A is mutually intelligible with B, B is mutually intelligible with C, but A is not mutually intelligible with C. The traditional manner in which linguists have approached this situation is to say that there is no way to assign languages over A, B, C, without somehow getting into contradictions, given the concept of language a maximal set of mutually intelligible varieties — A, B, C cannot all be the same language, as A and C are not mutually intelligible.
If A, B is one language, then by the same token B, C should also be one language, but if A is the same as B and B is the same as C, then A and C must be the same, but they are not mutually intelligible!
For this reason, linguistic have though the concept of language as being born with logical inconstiencies, and as a result, declared it impossible to count the number of languages in the world. This traditional view is too narrow, and to claim that there is no meaningful way to count the number of languages is wrong.
In Chapter V, we give a novel intuitively sound interpretation to show that it is possible to count the number of languages without any inconsistencies in any arrangement of speech varieties, as long as we assume that each pair of varieties can be decided mutual intelligible or not. In Linguistic Typology, cross-linguistic facts are noted and non-random dis- crepancies are sought to be explained.
Many different kinds of explanations could a priori be invoked, psycholinguistic, historical, cultural etc. In Chapter VII we present a rigid definition and a thorough survey of facts on one aspect of human language, namely number bases in the numeral system. It is presum- ably the first such survey that is explicitly known to cover languages from every language family attested in the world and thereby we are able to set the record straight in a number of open cases.
In Chapter VI we attempt to trace the emergence of the base system in this area. Although the data is somewhat incomplete, there is evidence that the system came from yams counting. A cultural explanation, as the neighbouring non-base-6 languages do not rely on tuber cultivation for subsistence. Of these, many are on the path to extinction, in the sense that speakers, especially younger generations, are shifting to using another language, and consequently, as generations pass, no speakers at all will be left.
Languages today die at a much faster rate than languages diverge to become new languages. For a small group of people, the language is part of their identity, and while a few are happy to shift, most groups would like to maintain their language, and, if anything, be bilingual in another, bigger, language.
Languages documentation, i.
Book file PDF easily for everyone and every device. You can download and read online N`Hashi: die kleine. Indianerin (German Edition) file PDF Book only if. Buy the Kobo ebook Book N`Hashi by Arno Schrader at cusilleca.tk, Canada's largest bookstore. N`Hashi: die kleine Indianerin by Arno Schrader ebook Published:November 23, Publisher:Books on DemandLanguage:German.
Language documentation, is and has been, an extremely decentralised ac- tivity. It has been the outcome of linguists, missionaries, travellers, anthropolo- gists, administrators etc stationed at missions, colonial establishments, univer- sities in the first world and universities in the third world, over the past several centuries.
There is no central record of which and how many languages have been described and to what level. From the perspective of science, the highest priority are languages otherwise poorly documented which are not genetically related to some other language which is not so poorly documented. Making such a list involves considerable bookkeep- ing work and a vast amount of analysing unclear cases, judging extinctness, and gauging relatedness of partly described, dubiously attested language varieties.
Springer-Verlag, Berlin. A naive theory of morphology and an al- gorithm for extraction. In Wicentowski, R. Association for Computational Linguistics. In Ng, H. A fine-grained model for language identifica- tion.
A survey and classification of methods for mostly unsupervised learning of morphology. Singapore: ACL. Forsberg, M. Lexicon extrac- tion from raw text data. In Salakoski, T. Automatic annotation of bibliographical ref- erences with target language. Counting languages in dialect continua using the criterion of mutual intelligibility. Journal of Quantitative Linguistics, 15 1 — Whence the Kanum base-6 numeral system? Linguistic Typology, 13 2 — Rarities in numeral systems.
In Wohlgemuth, J. Mouton de Gruyter.
All the work in the present thesis is the sole and original work of the author, except Chapter III and the last section of Chapter 8. In Chapter III, the present author conducted the experiment, took part in discussions, wrote the related work section and did the proof of NP-completeness, whereas the design, descrip- tion and implementation of the extraction-tool was the work of Markus Forsberg and Aarne Ranta. References Bharati, A. Unsuper- vised improvement of morphological analyzer for inflectionally rich languages.
Tokyo, Japan. Lexicon extraction from raw text data. Springer- Verlag, Berlin. A probabilistic model for guessing base forms of new words by analogy. In Gelbukh, A. Mikheev, A. Automatic rule induction for unknown-word guessing. Com- putational Linguistics, 23 3 — A naive theory of morphology and an algorithm for extraction.
A fine-grained model for language identification. Restrictions: We consider only concatenative morphology and assume that the corpus comes already segmented on the word level. The problem, in practice and in theory, is relevant for information retrieval, child language acquisition, and the many facets of use of computational mor- phology in general.