Lexical Category

Use Your Mind and Learn to Write: The Problem of Producing Coherent Text

Michael Zock , Debela Tesfaye Gemechu , in Cognitive Approach to Natural Language Processing, 2017

7.6.2 Identification of the semantic seed words

As mentioned already, in order to reveal the proximity or potential relation between two or more sentences, we can try to identify the similarity between the respective constituent words. We need to be careful though. If we do this by taking into account only the similarity values of the respective (pair of) words, we may bias the analysis and get incorrect results.

There are several problems at stake. For example, the number of identical words does not necessarily imply relatedness or similarity. Actually, two sentences may be composed of exactly the same words, and still mean quite different things, compare: "Women without their men are helpless" versus "Men without their women are helpless". Given the fact that such cases are quite frequent in natural language, we decided not to rely on (all) the words occurring in a sentence, or to use a "bag of words" approach (sentence without stop words). We preferred to rely only on specific words, called seeds, to compare the similarity of different sentences. We consider seed words to be elements conveying the core meaning of a sentence. For example, for the two sentences here above we could get the following seeds: (a) without (man, women); (b) without (women, men), which reveal quite readily their difference.

Our idea of choosing seed words seems all the more justified as different kinds of words (lexical categories) have different statuses: some words conveying more vital information than others. Nouns and verbs are generally more important than adjectives and adverbs, and each one of them normally conveys more vital information than any of the other parts of speech 13 . We assumed here that the core information of our sentences is presented via the nouns (playing different roles: subject, object) and the verb linking them (predicate). We further assumed that dependency information was necessary in order to be able to carry out the next steps. In the following two sentences, (a) "Foxes hide underground" and (b) "Foxes hide their prey underground", a "bag of word" method or a simple surface analysis would not do, as neither of them reveals the fact that the object of hiding ("fox" vs. "prey") is different in each sentence, a fact that needs to be made explicit.

To avoid this problem, we used the dependency information produced by the parser, which allowed us to determine the role of the nouns (deep-subject, deep-object) and the predicate (verb) linking the two. For example, this reveals the fact that the following two sentences are somehow connected: "Foxes eat eggs" and "Foxes eat fruits". In both cases, the concept "fox" is connected to some object ("egg" vs. "fruit") via some predicate, the verb "eat". The core of these two sentences is identical. Both of them tell us something about the foxes' diet or eating habits (egg, fruits). Note that while this method does not reveal the nature of the link (diet, food), it does suggest that there is some kind of link (both sentences talk about the very same topic: food). Hence, syntactic information (part of speech, dependency structure) is precious as it allows us to identify potential seed words that will be useful for subsequent operations.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978178548253350007X

Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications

Akhil Gudivada , ... Venkat N. Gudivada , in Handbook of Statistics, 2018

6 Morphology

Morphology deals with types of words and how the words are formed. It investigates the internal structure of words. Words differ in form and meaning. Form refers to what a word sounds like when it is uttered. Words belong to lexical categories, which are also called parts of speech . Lexical categories are classes of words (e.g., noun, verb, preposition), which differ in how other words can be constructed out of them. For example, if a word belongs to a lexical category verb, other words can be constructed by adding the suffixes -ing and -able to it to generate other words.

Lexical categories are of two kinds: open and closed. A lexical category is open if the new word and the original word belong to the same category. Nouns, verbs, adjectives, and adverbs are open lexical categories. In contrast, closed lexical categories rarely acquire new members. They include conjunctions (e.g., and, or, but), determiners (e.g., a, the), pronouns (e.g., he, she, they), and prepositions (e.g., of, on, under). The creation of different grammatical forms of words is called inflection.

A word may have a root part and an affix part. An affix can be a prefix or suffix. Word roots and affixes are called morphemes. Words are compared based on form, meaning, and lexical category. This enables segmentation of words into their smaller parts called morphemes. Free morphemes are simple words which can be used by themselves. Bound morphemes, on the other hand, cannot stand by themselves. Therefore, they must be attached to a word stem of some other word.

Compounding is a process in which new words are formed from two or more independent words. Reduplication is the process for forming new words by doubling an entire free morpheme or part of it. It is also possible for making internal modifications to a morpheme, which is called alternations (e.g., man and men).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0169716118300208

The Way of the Engineer

Robert Charles Metzger , in Debugging by Thinking, 2004

13.4.3.1 Compensate for human psychology

There are at least three ways to compensate for human psychology to avoid transcription stage errors:

1.

Overcome psychological set.

2.

Maximize psychological distance.

3.

Prevent typographical errors.

Overcome psychological set

Psychological "set" reduces the ability of programmers to find errors [Uz66]. In our context, set causes the brain to see what it expects to see, rather than what the eye actually perceives. A transcription error may be completely masked by the set of the author. He or she knows what the program is supposed to say and is incapable of seeing what it actually says.

One way to overcome your psychological set is to have another person read your program. That person won't come to the text with the same assumptions you have. If you're working alone, leave the workplace and do something that has nothing to do with programming. When you return, your change of venue will often have broken your set.

A second way to overcome your psychological set is to display the program in a different format than that in which you're accustomed to viewing it. Here is a list of different ways to look at a program listing:

Look at a paper listing instead of a video display.

Display one token per line.

Display a blank space after each token.

Display the listing with the comments deleted.

Display everything in reverse video.

Display different lexical categories in various formats. Lexical categories include keywords, numeric literals, string literals, user names, and those glyphs that aren't numbers or letters. Different formats include all uppercase, italics, boldface, different colors for each category, and so forth.

Display numbers in equivalent formats. Show them in hexadecimal notation, scientific notation, or spelled out in words.

Display the program nesting structure in varying colors.

A third way to overcome your psychological set is to listen to the program being read. You can ask another programmer to read it to you. You can also have a voice synthesis program read it to you. In both cases, read along as the program is read aloud. You will notice that other human readers will separate and group items differently than you do.

Maximize psychological distance

All misreadings aren't created equal. Information theory says that messages have a distance that is computed by counting the number of positions in which they differ [SW49].

The distance between "word" and "work" is one. The distance between "word" and "warm" is two. To reduce the likelihood of transcription errors, maximize the distance between characters and words that may be confused.

The number of different characters is only a starting point. Some pairs of letters and digits are much easier for the brain to misread than others. Easily confused pairs include the following:

0 and O

k and x

i and 1

w and v

m and n

The positions of differing characters are also important. When words differ in the first or last positions, people are less likely to misread them. Consider how likely you might be to confuse the following pairs of words:

"word" versus "ward"

"word" "versus" "cord"

"word" versus "worm"

Prevent typographical errors

Errors in keying must be considered when selecting names for constructs in a program. Names that differ only in characters adjacent on the keyboard are more likely to be mistyped. Touch typists are more likely to mistype characters that are reached by the same fingers on opposite hands.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781555583071500136

Linguistics: Prototype Theory

J.R. Taylor , in International Encyclopedia of the Social & Behavioral Sciences, 2001

4 Prototypes and Linguistic Categories

The prototype concept may be applied, not only to the study of word meanings, but to the very categories of linguistic description. Many linguistic concepts (including the very notion of 'language' and 'a language') turn out to have a prototype structure, in that they exhibit degrees of representativity, and often have fuzzy boundaries. Uncertainty as to what goes into a category and what does not pertains even to such basic notions as what constitutes a word (as opposed to a bound morpheme, clitic, or phrase), the part-of-speech categories (whether a particular item is a noun, verb, adjective, etc.), constituency (whether a particular string of words is a syntactic constituent), and syntactic relations (whether a particular constituent is a modifier or adjunct, the subject of a clause, or whatever).

Some linguistic theories propose to decide such issues by fiat. In practice, however, the linguist's strategy often reflects a prototype conception, even though this might not be explicitly acknowledged. When deciding the category status of a linguistic item, it is usual to apply a set of tests (Croft 1991). The tests are set up on the basis of what, intuitively, count as good examples of the category in question, whereby each of the tests diagnoses a property typical of the good examples. An item which passes all the tests is ipso facto a member of the category; an item which fails the tests is not a member. Sometimes, an item passes only some of the tests; it will have to be regarded as a marginal example, or even as an item of uncertain status.

Consider the parts of speech, perhaps better called 'lexical categories,' such as 'noun,' 'verb,' 'adjective,' 'preposition' (see Word Classes and Parts of Speech ). These can be characterized either in semantic or in formal terms. Semantically, nouns prototypically designate categories of things, i.e., concrete, time-stable, spatially bounded objects (chain, cup, etc.); verbs designate events, involving rapid changes in state (explore, arrive); adjectives designate fairly stable properties of things (hot, young); while prepositions designate a relation, typically a spatial relation, between things (on, at ). It is commonly agreed that lexical category cannot be reliably predicted from a word's semantics. In spite of their reference to rapid changes in state, explosion and arrival are nouns, just as good nouns, in fact, as chair and cup. This is because the words exhibit the full range of formal properties typical of the noun category, i.e., they pass all the syntactic tests for nounhood—they pluralize, they take the full range of determiners, they can be modified by an adjective, they can head a subject noun phrase, and so on. There are words which fail to exhibit the full range of properties. Groceries, in I bought some groceries, looks like the plural of a noun grocery—yet there is no such noun (* I bought a grocery), neither can groceries take numeral quantifiers ( *five groceries). Syntactically, groceries is a somewhat marginal noun (though still a noun). However, there are words whose status is genuinely unclear. Consider saying, in without my saying a word. Being the complement of a preposition, my saying a word looks like a noun phrase, headed by saying, and saying looks like a noun, in that it takes possessive my. Other determiners, though, are not possible (without *the/*that/*a saying a word). Moreover, saying is followed by what looks like a direct object (a word), and the ability to take a direct object is a distinctly verb-like property, not at all a property of nouns.

Both semantic and formal criteria, then, fail to deliver clear-cut lexical categories. Nevertheless, there is an important sense in which the semantic prototypes have priority. Any language has to provide its speakers with the means to refer to 'things' and 'events,' 'properties' and 'relations,' and the semantic prototypes of the lexical categories, in English and other languages, are understood in such terms. Indeed, without the semantic prototypes, there would be no basis for recognizing the categories 'noun' and 'verb' across the different languages of the world.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0080430767030667

Connected Computing Environment

StašaVujičić Stanković , ... Veljko Milutinović , in Advances in Computers, 2013

5.2.1 Overview of Resources Developed for the Processing of the Serbian Language

Numerous resources that can be used for the purposes of our research were developed for processing of Serbian, but a great number of them has to be modified or improved. Most of these resources were developed in the Unitex system [22], while some of them were adapted for the GATE system [23].

The Unitex system is an open-source system, developed by Sébastien Paumier at the Institut Gaspard-Monge, University of Paris-Est Marne-la-Vallée, in 2001. It is a corpus processing system based on automata-oriented technology that is in constant development. The system is used all over the world for NLP tasks, because it provides support for a number of different languages and for multi-lingual processing tasks.

One of the main parts of the system are electronic dictionaries of the DELA type (Dictionnaires Electroniques du Laboratoire d'Automatique Documentaire et Linguistique or LADL electronic dictionaries), which is presented in Fig. 2. The system consists of the DELAS (simple forms DELA) and the DELAF (DELA of inflected forms) dictionaries. The DELAS dictionaries are the dictionaries of simple words, non-inflected forms, while the DELAF dictionaries contain all of simple words' inflected forms. On the other side, the system contains dictionaries of compounds DELAC (DELA of compound forms) and dictionaries of inflected compound forms DELACF (DELA of compound inflected forms).

Fig. 2. The DELA system of electronic dictionaries.

The morphological dictionaries in the DELA format were proposed in the Laboratoire d'Automatique Documentaire et Linguistique under the guidance of Maurice Gross. The DELA format of the dictionaries is suitable for resolving problems of the text segmentation and morphological, syntactic, and semantic text processing. More details about the DELA format can be found in [24].

The morphological electronic dictionaries in the DELA format are plain text files. Each line in these files contains a word entry and the inflected form of the word. In other words, each line contains the lemma of the word and some grammatical, semantic, and inflectional information.

An example entry from the DELAF dictionary in English is "tables,table.N+Conc:p." The inflected form tables is mandatory, table is the lemma of the entry, while the N+Conc is the sequence of grammatical and semantic information ( N denotes a noun, and Conc denotes that this noun is a concrete object), p is an inflectional code, which indicates that the noun is plural.

These kinds of dictionaries are under development for Serbian by the NLP group at the Faculty of Mathematics, University of Belgrade. According to [21], the present size of the DELAS Serbian morphological dictionary (of simple words), contains 130,000 lemmas. Most of the lemmas from the DELAS dictionary belong to general lexica, while the rest belong to various kinds of simple proper names. The DELAF dictionary contains approximately 4,300.000 word forms with assigned grammatical categories. The size of DELAC and DELACF dictionaries are approximately 10,500 and 54,000 lemmas, respectively.

Similar example entry from the Serbian dictionary of simple word forms with appropriate grammatical and semantic code is padao,padati.V+Imperf+It+Iref:Gms, where the word form padao is singular (S) masculine gender (M) of the active past participle (G) of the verb (V) padati "to fall" that is imperfective (Imperf), intransitive (It), and ireflexive (Iref).

Another type of resources developed for Serbian are different types of finite-state transducers. Finite-state transducers are used to perform morphological analysis and to recognize and annotate phrases in weather forecast texts with appropriate XML tags such as ENAMEX, TIMEX, and NUMEX, as we have explained before.

An example of the finite-state transducer graph for recognition of temporal expressions and their annotation with TIMEX tags is presented in Fig. 3. This finite-state transducer graph can recognize the sequence "14.01.2012." from our weather forecast example text, and annotate it with TIMEX tag, so it can be extracted in the form "DATE_TIME: 14.01.2012."

Fig. 3. Finite-state transducer graph for extracting and annotating temporal expressions.

Another system that is more suitable for solving the IE (and CM) problem is open-source free software GATE (General Architecture for Text Engineering). The GATE system or some of its components are already used for a number of different NLP tasks in multiple languages. The GATE system is architecture and a development environment for NLP applications. GATE is in development by the University of Sheffield NLP Group since 1995.

The GATE architecture is organized in three parts: Language Resources, Processing Resources, and Visual Resources. These components are independent, so the different types of users can work with the system. Programmers can work on development of the algorithms for NLP, or customize the look of visual resources for their needs, while linguists can use the algorithms and the language resources without any additional knowledge about programming.

Visual Resources, graphical user interface, will remain in its original form for the purpose of this research. It will allow the visualization and editing of Language Resources and Processing Resources. Language Resources for this research involve corpuses of weather forecast texts in multiple languages and weather forecast Concept Model that will be built. The required modification of Processing Resources, especially modifications for application to the processing of Serbian texts are presented below.

The IE task in GATE is embedded in the ANNIE (A Nearly-New Information Extraction) System. Detailed information about ANNIE can be found in [25]. The ANNIE System includes the following processing resources: Tokeniser, Sentence Splitter, POS (part-of-speech) tagger, Gazetteer, Semantic Tagger, and Orthomatcher. Each of the listed resources produces annotations that are required for the following processing resource in the list:

Tokeniser splits text into tokens (words, numbers, symbols, white patterns, and punctuation). Tokeniser can be used to process texts in different languages with a little or no modifications.

Sentence Splitter segments the text into sentences using cascades of finite-state transducers. It is also application and language-independent.

POS Tagger assigns a part-of-speech tag (lexical category tag, e.g., noun, verb) in the form of annotation to each word. English version of POS Tagger included in the GATE system is based on the Brill tagger. It is application-independent, but language-dependent resource, and has to be completely modified for Serbian.

Another language-dependent, but application-dependent, resource is Gazetteer that contains lists of cities, countries, personal names, organizations, etc. Gazetteer uses these lists for annotating the occurrence of the list items in the texts.

Semantic Tagger is based on the JAPE (Java Annotations Pattern Engine) language [26]. JAPE performs finite-state processing over annotations based on regular expressions and its important characteristic is that it can use Concept Models (ontologies).

Semantic Tagger includes a set of JAPE grammars, where each grammar consists of a set of phases and each phase represent a set of rules. Furthermore, rules contain Left-Hand Side and Right-Hand Side. Left-Hand Side (LHS) of the rule describes the annotation pattern to be recognized usually based on the Kleene regular expression operators. Right-Hand Side (RHS) of the rule describes the action that has to be taken after the LHS recognize the pattern, e.g., new annotation creation. It is an application-and language-dependent resource.

Orthomatcher identifies relations between named entities found by the Semantic Tagger. It produces new annotations based on relations between named entities. It is an application-and language-independent resource.

The main goal is to develop language-or application-dependent resources (Gazetteer, POS Tagger, and Semantic Tagger) for Serbian. The way that we chose for solving this problem is to use previously described resources developed for the Unitex system and adapt them for usage in the GATE system. Modifying the Gazetteer lists is a simple process of translation from one language to the other. Building JAPE grammars with ontology support for weather forecast domain requires the initial development of appropriate sublanguage and Concept Model, which will be discussed in the following subsection, as the subject of the authors' ongoing research. The problem of creating an appropriate POS Tagger is highly complex and will not be presented here in details.

Briefly, we made the wrapper for Unitex, so it can be used directly in the GATE system to produce electronic dictionaries for given weather forecast texts and a mechanism for generating appropriate POS Tagger annotation for each word. This solution of the problem, although showed a few disadvantages, could represent a good foundation for building Semantic Taggers based on Concept Models and IE systems in general.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012408091100004X

The Metaphysics of Natural Language(s)

Emmon Bach , Wynn Chao , in Philosophy of Linguistics, 2012

4.1 Things and Happenings

One of the peculiarities of the predicate calculus, from the point of view of natural languages, is that there is only one kind of predicate, so that the correspondents of dog and run have the same logical type and form. So also in PTQ: common nouns and intransitive verbs are syntactically distinct but map into the same type in the interpretation, sets of individuals or their intensional counterparts — individual concepts. Similarly for relational nouns, transitive verbs, and so on.

From the point of view of natural language, it seems that thee is a pervasive difference between things and happenings, independently of how this is expressed in the syntactic and morphological categorizations of the languages in question.

In English, for example. there is an affinity between certain predicates and happenings:

(7)

The war lasted seven years.

(8)

Jonathan lasted seven years.

(9)

The parade took place in the center of town.

(10)

?Harry took place in the 20th century.

(7) and (9) are perfectly ordinary sentences. With (8) we have to supply some understood amplification: 'as chairman' or the like. (10) is anomalous unless we understand Harry as the name of an event or come up with a metaphorical understanding.

There has been a long-standing discussion about the universality of "parts of speech," including much discussion about whether all languages distinguish nouns and verbs (and sometimes adjectives): [Sapir, 1921; Bach, 1968; Kinkade, 1983; Jacobsen, 1974; Jelinek and Demers, 1994; Jelinek, 1995; Demirdache and Matthewson, 1995; Baker, 2003; Evans and Osada, 2005]. A number of different questions are involved in these debates, among them the following:

(i)

Does every language distinguish between nominal and verbal syntactic constructions?

(ii)

Do all languages make a distinction between the lexical categories of noun and verb?

There is no doubt about the answer to the first question. Every language must be able to make truth-bearing constructions (assertions). And every language has a means to refer to an unlimited number of different entities. Quine's minimalist language [Quine, 1948] makes do with a single lexical class but must still include variables as a second category (as well as quantifiers and other syncategorematic signs).

Compare Sapir [1921: 119]:

Yet we must not be too destructive. It is well to remember that speech consists of a series of propositions. There must be something to talk about and something must be said about this subject of discourse once it is selected. This distinction is of such fundamental importance that the vast majority of languages have emphasized it by creating some kind of formal barrier between the two terms of the proposition. The subject of discourse is a noun. As the most common subject of discourse is either a person or a thing, the noun clusters about concrete concepts of that order. As the thing predicated of a subject is generally an activity in the widest sense of the word, a passage from one moment of existence to another, the form which has been set aside for the business of predicating, in other words, the verb, clusters about concepts of activity. No language wholly fails to distinguish noun and verb, though in particular cases the nature of the distinction may be an elusive one. It is different with the other parts of speech. Not one of them is imperatively required for the life of language.

(This passage occurs immediately after a paragraph disparaging the idea that parts of speech might be universal. Sapir opts for a language-particular view of such distinctions.)

The second question is altogether different. It is still debatable whether lexical classes like noun and verb and adjective must be distinguished in every language. It is not in question whether these categories are part of a universal stock of available categories nor that nouns and verbs are specially connected to the syntactic categories associated with naming and referring, and predicating.

There is no doubt that we can refer to the same event using either nominal or verbal expressions. Further there must be logical connections between these references as shown by Terry Parsons' examples [Parsons, 1990:18]:

(11)

In every burning, oxygen is consumed.

(12)

Agatha burned the wood.

(13)

Oxygen was consumed.

What we have considered in this section is related to a question that will come up below when we talk more directly about events and other species of eventualities (Section 4.4).

The pertinent question in the present context is whether there is a significant semantic contrast between common nouns and verbs. A positive answer to this was given some time ago by Gupta who pointed out that nouns carried with them a principal of reidentification, so that it makes sense to apply the adjective same to a noun, whereas it is difficult to get that idea across in a purely verbal construction:

(14)

Is that the same squirrel / man / explosion / burning that you saw?

(15)

?? Did Superman and Clark Kent run identically / the same / samely?

(16)

Did Harry run the same run / race / route as Sally?

(Compare [Gupta, 1980], also [Carlson, 1977; Baker, 2003].) Note that this notion of sameness (and difference) can be applied to kinds and ordinary individuals (see next section):

(17)

Is that the same dog? Yes, it's a Basenji. / Yes, it's Fido.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444517470500068

Word Classes and Parts of Speech

M. Haspelmath , in International Encyclopedia of the Social & Behavioral Sciences, 2001

6 Theoretical Approaches

While the identification and definition of word classes was regarded as an important task of descriptive and theoretical linguistics by classical structuralists (e.g., Bloomfield 1933), Chomskyan generative grammar simply assumed (contrary to fact) that the word classes of English (in particular the major or 'lexical' categories noun, verb, adjective, and adposition) can be carried over to other languages. Without much argument, it has generally been held that they belong to the presumably innate substantive universals of language, and not much was said about them (other than that they can be decomposed into the two binary features [±N] and [±V]: [+N, −V]=noun, [−N, +V]=verb, [+N, +V]=adjective, [−N, −N]=adposition) (see Linguistics: Theory of Principles and Parameters ).

Toward the end of the twentieth century, linguists (especially functionalists) became interested in word classes again. Wierzbicka (1986) proposed a more sophisticated semantic characterization of the difference between nouns and adjectives (nouns categorize referents as belonging to a kind, adjectives describe them by naming a property), and Langacker (1987) proposed semantic definitions of noun ('a region in some domain') and verb ('a sequentially scanned process') in his framework of Cognitive Grammar. Hopper and Thompson (1984) proposed that the grammatical properties of word classes emerge from their discourse functions: 'discourse-manipulable participants' are coded as nouns, and 'reported events' are coded as verbs.

There is also a lot of interest in the cross-linguistic regularities of word classes, cf. Dixon (1977), Bhat (1994) and Wetzer (1996) for adjectives, Walter (1981) and Sasse (1993a) for the noun–verb distinction, Hengeveld (1992b) and Stassen (1997) for non-verbal predication. Hengeveld (1992a) proposed that major word classes can either be lacking in a language (then it is called rigid) or a language may not differentiate between two word classes (then it is called flexible). Thus, 'languages without adjectives' (cf. Sect. 6) are either flexible in that they combine nouns and adjectives in one class (N/Adj), or rigid in that they lack adjectives completely. Hengeveld claims that besides the English type, where all four classes (V−N−Adj−Adv) are differentiated and exist, there are only three types of rigid languages (V−N−Adj, e.g., Wambon; V−N, e.g., Hausa; and V, e.g., Tuscarora), and three types of flexible languages (V−N−Adj/Adv, e.g., German; V−N/Adj/Adv, e.g., Quechua; V/N/Adj/Adv, e.g., Samoan).

The most comprehensive theory of word classes and their properties is presented in Croft (1991). Croft notes that in all the cross-linguistic diversity, one can find universals in the form of markedness patterns; universally, object words are unmarked when functioning as referring arguments, property words are unmarked when functioning as nominal modifiers, and action words are unmarked when functioning as predicates. While it is not possible to define cross-linguistically applicable notions of noun, adjective, and verb on the basis of semantic and/or formal criteria alone, it is possible, according to Croft, to define nouns, adjectives, and verbs as cross-linguistic prototypes on the basis of the universal markedness patterns.

For a sample of recent work on word classes in a cross-linguistic perspective, see Vogel and Comrie (2000), and the bibliography in Plank (1997). Other overviews are Sasse (1993b), Schachter (1985), and further collections of articles are Tersis-Surugue (1984) and Alpatov (1990).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0080430767029594

NLP-assisted software testing: A systematic mapping of the literature

Vahid Garousi , ... Michael Felderer , in Information and Software Technology, 2020

2.1 An overview of NLP

NLP is a prominent sub-field of both computational linguistics and artificial intelligence (AI). NLP covers the "range of computational techniques for analyzing and representing naturally-occurring texts […] for the purpose of achieving human-like language processing" [8]. Challenges in NLP usually involve speech recognition, natural-language understanding, and natural-language generation.

NL structures might be rule-based from a syntactic point of view, yet the complexity of semantics is what makes language understanding a rather challenging idea. For instance, a study reported that the sentence "List the sales of the products produced in 1973 with the products produced in 1972." offered 455 different semantic-syntactic parses [8]. This clearly demonstrates the problems of computational processing: while linguistic disambiguation is an intuitive skill in humans, it is difficult to convey all the small nuances that make up NL to a computer. For that reason, different sub-fields of NLP have emerged to analyze aspects of NLP from different angles. Those sub-fields include: (1) Discourse Analysis [9], a rubric assigned to analyze the discourse structure of text or other forms of communication; (2) Machine Translation [10], intended to translate a text from one human language into another, with popular tools such as "Google Translate"; and (3) information extraction (IE), which is concerned with extracting information from unstructured text utilizing NLP resources such as lexicons and grammars [11]. Among all NLP approaches, IE is often the most widely used in the software engineering context [12]. We thus provide an overview of concepts in IE.

IE is described by three dimensions: (1) the structure of the content plays a role, ranging from free text, HTML, XML, and semi-structured NL; (2) the techniques used for processing the text must be determined; and (3) the degree of automation in the collecting, labeling and extraction process must be considered. For structured text such as HTML or XML, information retrieval is delimited by the labels or tags, which can be extracted. Free text, however, requires a much thorough analysis prior to any extraction. In the following, we briefly explain the concepts from IE, which are relevant for this paper.

Morphology:

Part of Speech Tagger (POS): A form of grammatical tagging in which a phrase (sentence) is classified according to its lexical category. The categories include nouns, verbs, adverbs, adjectives, pronouns, conjunction and their subcategories.

Stemming:In stemming, derived words are reduced to their base or root forms. For example, the words "am, is, are" are converted to their root form "be".

Named-Entity Recognition (NER):NER allocates types of semantics such as person, organization or localizationin a given text [13].

Syntax:

Constituency/Dependency Parsing: Although sometimes used interchangeably, Dependency Parsing focuses on the relationships between words in a sentence (in its simplest form, the classic subject-verb-object structure). Constituency Parsing, however, breaks a text into sub-phrases. Non-terminals in the parse tree are types of phrases (noun or verb phrases), whereas the terminals are the words in the sentence, yielding a more nested parse tree.

Semantics:

Semantic Role Labeling (SRL):SRL is also called shallow semantic parsing. In SRL, labels are assigned to words or phrases in a sentence that indicate their semantic role in the sentence. These roles can be agent, goal, or result.

Word Sense Disambiguation: Detecting the correct meaning of an ambiguous word used in a sentence.

The above definitions shall help the reader understand the NLP concepts, and their usage in software testing, when reading the rest of this paper. As a concrete example, we show in Fig. 1 the outputs of applying several NLP techniques on the following example NL requirement item: If the user enters valid user name and password, then the system should let the user log in. We have used two online tools to do this example analysis: www.corenlp.run and macniece.seas.upenn.edu:4004.

Fig. 1

Fig. 1. Applying several NLP techniques on an example NL requirement item.

Read full article

URL:

https://www.sciencedirect.com/science/article/pii/S0950584920300744