Newsletter 2

No 2
May - 1995

The corpora used by CONTRAGRAM
Corpus research and the patterns of prétendre
Bilingual dictionaries and corpus research
Frequency data from corpus research: the case of beslissen
Book notice: Petra CAMPE (1994) Case, Semantic Roles, and Grammatical Relations: A Comprehensive Bibliography
Personalia of the supervisors

The corpora used by CONTRAGRAM

The Dutch language corpus currently used in the CONTRAGRAM research is the INL 5 Million Words Corpus '94 of the Instituut voor Nederlandse Lexicologie (Institute for Dutch Lexicology) in Leiden (The Netherlands). The INL corpus was compiled in 1994 under the direction of P. Van Sterkenburg of the INL. The corpus, which can be used on-line, contains 17 samples of text, 15 of which date from the period 1989-1994. It contains 5 million words taken from written-to-be-spoken as well as written-to-be-read texts. In the first section newsreals and Queen's speeches are included, in the second one texts from newspapers, magazines on the environment, linguistics, politics and leisure, next to books on the environment, linguistics, business and employment, and politics. A description of corpus development and corpus exploitation at the INL is given in Kruyt (1995).

Kruyt, J.G (1995) Nationale tekstcorpora in internationaal perspectief. Forum der letteren 36, 1: 47-58.

The English language corpus currently used in the CONTRAGRAM research is the Lancaster-Oslo/Bergen (LOB) Corpus, which was compiled in the 1970s under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo. It is a 1-million-word corpus containing 500 2,000-word text samples selected from texts printed in Great Britain in 1961. A full description of the corpus is given in Johansson et al. (1978), Hofland and Johansson (1982) or Johansson and Hofland (1989). We use the tagged version of the corpus available on the ICAME Collection of English Language Corpora CD-ROM (1991), distributed by the Norwegian Computing Centre for the Humanities.

Hofland, K. and S. Johansson (1982) Word frequencies in British and American English. Bergen: Norwegian Computing Centre for the Humanities/London: Longman.
Johansson, S. and K. Hofland (1989) Frequency analysis of English vocabulary and grammar. Vol. 1 Tag frequencies and word frequencies. Oxford: Clarendon Press/Oxford University Press.
Johansson, Stig, Geoffrey Leech and Helen Goodluck (1978) Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Oslo: Department of English, University of Oslo.

The French corpora used by CONTRAGRAM are:
1. the corpora compiled at the Department of French Linguistics of the University of Gent which comprise standardized versions and abstracts of scientific prose by J. Rostand (altogether 70,813 words) and of C. de Gaulle's speeches (1958-1965) as they can be found in Moreau R. and J.M. Cotteret (1969) (62,882 words). More information on these corpora can be found in Willems and Cuyvers (1975).
2. Engwall's corpus of extracts from 25 French novels published between 1962 and 1968, available on microfiches that go with Engwall (1984). Engwall's corpus totalizes ½ million words.
3. all the articles published in the French newspaper Le Monde between January 1st and March 31st 1994, available on CD-ROM from Research Publication International. Together they contain over 5 million words.

Engwall, G. (1984) Vocabulaire du roman fran‡ais (1962-1968). Stockholm: Norstedt. Moreau, R. and J.M. Cotteret (1969) Recherches sur le vocabulaire du g‚n‚ral de Gaulle, analyse statistique des allocutions radiodiffus‚es (1958-1965). Paris: Colin. Willems, D. and H. Cuyvers (1975) Pour une analyse syntaxique automatique du fran‡ais. Projet de reconnaissance des formes verbales en fran‡ais moderne. Travaux de Linguistique 4: 63-86.

[table of contents]

Corpus research and the patterns of prétendre

Bart Defrancq

Corpus research is currently enjoying some sort of revival within linguistics after years of intuition- or competence-based research and it may be useful to compare the results of these two approaches. To this end, let us have a look at the French verb prétendre, which has in the past been analysed by intuition-oriented researchers, and has now been 'revisited' by the CONTRAGRAM team. When we look up prétendre in Winfried Busse and Jean-Pierre Dubost's (henceforth B&D) FranzösischesVerblexicon which we could consider to be one of the very few real verb valency dictionaries, we find the following 3 entries with 8 patterns:

prétendre1
N - V - à N
      - à Inf 1
prétendre2
N - V - que S Kj
      - Ø Inf 1
prétendre3
N - V - (à N) - N
              - que S Ind Ind/Kjneg/int
              - Ø Inf 1
N - V - N - Adj

The patterns under 'prétendre1' and 'prétendre2' are exemplified with real examples: there is only one example without a reference. However, the examples originate from very heterogenous material comprising both Molière and Le Monde (without further specification). The treatment of the two patterns under 'prétendre3' is different: while the different uses of indicative (Ind) and subjunctive (Kj) in the subordinate clause are extensively illustrated with corpus examples (Pilhes, Gide and Aragon) except one, the examples illustrating the other patterns seem to be unattested ones. Since the authors make no mention of a source we must assume that they have sprung from their intuitions.

The question is, however, whether these intuitions can be trusted, the answer to which will come from a systematic confrontation of their examples with examples taken from a corpus.

Such a confrontation reveals, first of all, the incompleteness of the proposed patterns. Consider the following example from Le Monde (January-March 1994): Qui peut prétendre à une plus grande efficacité? (LMCD9294R03ini). It appears from the context that the corresponding paraphrase is: Qui peut prétendre qu'il est plus efficace? At first sight, the pattern underlying this example is the first one of B&D's 'prétendre1'. In that case, however, we should identify it with B&D's Le Prince prétend au trône, even though the meaning is radically different (it more or less corresponds with Le Prince veut le trône). The only solution to this problem is to introduce two formally identical patterns N - V - à N, one under 'prétendre1' (trône), the other under 'prétendre3' (efficacité).

While the omission of a meaning a pattern can have is perhaps a rather innocent and easely amended shortcoming, the inclusion of an indirect object in the patterns under 'prétendre3' is more serious. B&D's examples seem to suggest that the indirect object is an optional constituent of the verb: (when the complement is a noun) Je prétends le contraire. Il m'a prétendu juste le contraire and (when the complement is a subordinate clause) Il m'a prétendu qu'il ne le savait pas. Je prétends que cela est juste. However, none of the corpus examples with indicative and subjunctive subclauses contains an indirect object. Of the pattern with infinitive under 'prétendre3' which appears to share the indirect object with the other patterns there, there is only one intuition-based example and it does not contain an indirect object: Il prétend être le fils de Guillaume II.

In fact, a search in the Le Monde corpus (January-March 1994) reveals that out of 322 occurrences of prétendre, not a single one contains an indirect object like the one described by B&D. It appears that prétendre disallows an indirect object, or at least an 'orthodox' one, because two examples were found of an alternative construction with devant, one of them being: Le premier qui ose prétendre devant moi que la Bourse reflète parfaitement l'état économique du pays... je crois que je le griffe violemment (LMCD4194R02mod). The absence of 'normal' indirect objects could be explained by the fact that prétendre - as well as its equivalents in Dutch and English beweren and claim which do not allow indirect objects either, stresses the veracity of the message and does not perspectivize the individual to whom the message is addressed. Another example of this type is nier-ontkennen-deny.

These corpus findings should of course be verified using others types of corpora. It is not impossible that a corpus of oral texts does produce examples similar to the ones of B&D. But, even if such examples could be found, it must be clear that a description which explicitly refers to the presence of an indirect object doesn't really match the observable facts and that claims about the incompleteness and inaccuracy of corpus examples should be treated with care.

References

Busse W. and J.-P. Dubost (1983) Französisches Verblexicon. Stuttgart.
Defrancq B., F. Devos and D. Noël (in prep.) Réflexions sur l'approche contrastive: les verbes prétendre, claim et beweren. Manuscript.

[table of contents]

Bilingual dictionaries and corpus research

Dirk Noël

Especially in the United Kingdom, compilers of explanatory dictionaries are increasingly making use of large language databases to ensure that users are presented with 'real English', i.e. with actually used meanings of actually used lexicogrammar, and their publishers go to great lengths to point this out to the potential buyers. As yet, this is less true of bilingual dictionaries. Only one of the three recently published English-French dictionaries devotes a whole paragraph of the preface to corpus research (Hachette Oxford 1994), another one is rather vague about a corpus in one single subclause (Collins Robert 1993), and the third makes no mention of a corpus at all (Larousse 1993). So how 'real' is the English they present?

A little investigation, reported on in Noël, Defrancq and Devos (in prep.), in which the use of the verb consider in the LOB corpus was confronted with the treatment of this verb in the English-French sections of the above dictionaries, made clear a) that there is some difference of opinion on the conceptual area covered by consider and how this should be partitioned, and b) that these dictionaries present a very fragmentary picture of the grammatical potential of this verb. For maximal userfriendliness it seems important, however, that bilingual dictionaries present a) all the meanings a word can have, even if closely related meanings do not require different translations (but especially if they do), and b) all the grammatical patterns a word can enter into, especially when certain patterns are restricted to certain meanings. Users of Collins Robert and Hachette Oxford, for instance, would have more difficulty finding an appropriate translation for (1) than would users of Larousse, because unlike the latter the former do not make mention of the meaning of consider that can be paraphrased by 'to look at carefully'.

(1) He leaned away, considering her, his eyes teasing. (LOB P:Romance,love story P13:94)

Users of Larousse, on the other hand, would have more difficulty with a sentence like (2), because it does not exemplify the CONSIDER + that-clause pattern, unlike the other two dictionaries.

(2) For Kilbracken indeed he had great admiration, but considered that he was timid when it came to the crux. (LOB G:Belle lettres,biog G08:9)

Users of Hachette Oxford might be misled because the to consider sb/sth as sth pattern is only mentioned in connection with the sense glossed as 'envisage, contemplate', and not in connection with the sense it glosses as 'regard', which would lead to a correct interpretation of (3), but could lead to a misinterpretation of (4).

(3) Or, a Middlesex federation of labour parties working within a regional council covering the northern home counties might be considered as a possible solution. (LOB F:Popular lore F16:60)
(4) The importance has been previously stressed of considering the production, handling, storage, packaging and processing of food as links in one continuous chain of operations, the final objective of which is to provide the nation with food of the highest quality at the lowest economic price. (LOB H:Miscellaneous H10:32)

Another thing coming out of our confrontation of corpus data with English-French dictionaries is that the frequency of occurrence of the different meanings a word can have is not reflected in the organization of the dictionary entries. Often less frequent meanings are presented before more frequent ones. Greater userfriendliness can be achieved, however, if meanings (and grammatical patterns) are ordered in a way that is consistent with their frequency of use. That way entries would paint a truer picture of the conceptual area covered, and an additional bonus is that uses one is more likely to encounter can be retrieved faster. In this respect Larousse does a much better job than Collins Robert and Hachette Oxford, because it starts the entry for consider with its most frequent meaning, which it glosses 'believe' and which can be paraphrased 'have the opinion that somebody or something is something', whereas the other two end with this meaning.

Because many readers of Contragram are perhaps more likely to use an English-Dutch dictionary than an English-French one, a few words also about Van Dale (1989), part of a set widely considered to be the leading bilingual dictionaries in Belgium and the Netherlands. Unlike the English-French dictionaries mentioned above, Van Dale does not offer sense glosses, but merely proposes translations, apparently grouped on a semantic basis. The groups proposed for consider could be glossed 1) 'to think carefully about something; to contemplate a possibility', 2) 'have the opinion that', and 3) 'to bear in mind'. Users of Van Dale are therefore likely to have as much difficulty with sentences like (1) as users of Collins Robert and Hachette Oxford. Van Dale also offers some examples of possible patterns, but much more sparingly than its English-French counterparts. For instance, the first group of possible translations is not exemplified at all.

So, when considering bilingual dictionaries against a corpus of real language, it becomes clear a) that there are important gaps in both the semantics and the lexicogrammar they cover, and b) that the organization of entries does not match the logic of corpus data. An often-heard complaint from language teachers is that 'students can't use dictionaries' -- 'they just pick the first translation the dictionary offers without thinking about its appropriateness -- but the fact is that dictionaries could do more to make it easier for them to make judgements about appropriateness. All the dictionaries we have examined already do a lot in this respect, but there is certainly room for improvement. What is needed is, first of all, greater explicitness about the meanings of entry words and about the relation between meaning and form, and second, an organization of entries on the basis of frequency. Corpus research could do much to help to fill the gaps and is a sine qua non to meet the second need. Compilers of bilingual dictionaries would therefore be well-advised to take corpora seriously.

References

Collins Robert (1993) The Collins Robert French Dictionary. (Third edition.) London and Paris: HarperCollins and Dictionnaires Le Robert.
Oxford Hachette (1994) The Oxford-Hachette French Dictionary. Oxford and Paris: Oxford University Press and Hachette Livre.
Harrap's (1980) Harrap's Standard French and English Dictionary. Edinburgh: Harrap.
Larousse (1993) French-English/English-French Dictionary: Unabridged. Paris: Larousse.
Larousse (1994) Standard French-English/English-French Dictionary. Paris: Larousse.
Noël, Dirk, Bart Defrancq and Filip Devos (in prep.) Considering bilingual dictionaries against a corpus: Do English-French dictionaries present 'real English'? Manuscript.
Van Dale (1989) Groot Woordenboek Engels-Nederlands. (Second edition.) Utrecht/Antwerpen: Van Dale Lexicografie.

[table of contents]

Frequency data from corpus research: the case of beslissen

Filip Devos

One of the most general conclusions that can be drawn from Devos et al. (1995) regarding the lexicological analysis of the verb beslissen and its proto-translations in French and English, décider and decide, is that French décider and English decide both cover a much wider conceptual range than Dutch beslissen. They somehow seem to 'enclose' the Dutch verb. The analysis of the proto-translations indeed makes it necessary to consider other Dutch verbs as well, amongst which doen beslissen/overtuigen, bepalen and, most prominently, besluiten. For instance, if one looks at the infinitival complementation of décider (e.g. Ils ont enfin décidé de déménager) and decide (e.g. They finally decided to move), one finds that these verbs are rendered in Dutch by besluiten (e.g. Ze besloten eindelijk te verhuizen), which occurs in several other formal patterns as well. Besluiten even seems to take the lion's share of the conceptual range that is covered by the French and English verbs.

In the frequency table below we counted the distribution of the formal constructions that can be distinguished for beslissen and besluiten and their proto-translations in French and English. Use was made of the INL Corpus for Dutch, the LOB Corpus for English and an automatized corpus of our own for French.

The complementation of the verb beslissen with an infinitive clause in row 7 in the table is striking, all the more because Dutch grammars and dictionaries do not mention the construction. We do find some examples in our corpus, however: the 'te + infinitive' construction occurs in only 3% of the cases with beslissen, but in 63% of all cases with besluiten. In French and English, on the other hand, infinitival complements occur quite often with décider and decide, in 39% and 41% of all tokens respectively. Likewise, our corpus for French and English contains no examples of conclure and conclude followed by an infinitive. This implies that the overlap between the two verbs is expressed at the formal level in Dutch.

In general, besluiten is much more frequently used than beslissen. It occurs 3.5 times more often than beslissen, while in French and English, this is precisely the other way round: conclure and conclude occur 5 to 6 times less frequently than décider and decide. As besluiten occurs more frequently and is hence more likely to be polysemous, this polysemy may account for the more limited conceptual range of beslissen in Dutch, as compared to its French and English equivalents.

Reference

Devos, Filip, Bart Defrancq and Dirk Noël (1995) Contrastive Verb Valency and Conceptual Structures in the Verbal Lexicon. (in preparation).

Frequency table
construction	BESLISSEN	BESLUITEN	DÉCIDER	CONCLURE	DECIDE	CONCLUDE
1 absolute use	15,91%	2,50%	13,25%	12,90%	2,65%	3,77%
2 +NP	9,85%	5,87%	22,88%	16,13%	7,95%	26,42%
3 +PP	34,09%	14,02%	11,44%	-	6,82%	5,66%
4 +NP +PP	-	2,17%	-	3,23%	-	1,88%
5 +NP +Pinf.	-	-	1,82%	-	-	-
6 +NP +ATTR	-	-	-	-	-	1,88%
7 +Pinf.	2,65%	62,72%	38,56%	-	40,91%	-
8 +Pfin.	37,50%	12,72%	12,05%	67,74%	41,67%	60,39%
8a dat/que/that	[27,27%]	[55,56%]	[75,00%]	[42,86%]	[54,55%]	[78,13%]
8b of/si/whether	[48,49%]	[12,82%]	[25,00%]	[-]	[1,81%]	[-]
8c wh-word	[22,22%]	[5,98%]	[-]	[-]	[30,00%]	[-]
8d reporting clause	[2,02%]	[25,64%]	[-]	[57,14%]	[13,64%]	[21,8%]

(Note: NP = noun phrase; PP = prepositional phrase; Pinf. = infinitive clause; Pfin. = finite clause; ATTR. = attribute)

[table of contents]

Book notice

Petra Campe (1994) Case, Semantic Roles, and Grammatical Relations: A Comprehensive Bibliography. Amsterdam: Benjamins. vii + 645 pp. (=Case and grammatical relations across languages, 1.) (ISBN 90 272 2811 6)

Dirk Noël

This new bibliographical tool, potentially of great interest to contrastive grammarians, is the first volume in a new series built around the work of the 'Case and Thematic Relations' Research Group of the University of Leuven. Four more titles are already announced: a two-volume collection of papers on the dative, one on the genitive, one on the nominative and accusative, and one on non-nuclear cases.

The book has four parts: a 12-page introduction, a 436-page 'alphabetical list', a 160-page subject index, and a 35-page 'language index and guide'. In the introduction Campe clarifies the coverage of the bibliography, pointing out the terminological confusion that plagues this particular research area, and explaining how the main sub-areas are treated in the subject index: morphological case is simply called case; cross-constituent relations or roles of the syntactic kind (like subject, direct object, indirect object, adjunct, etc.) are treated under grammatical relations and those of the semantic kind (like agentive, directional, patient, etc.) under semantic relations. Also covered are transitivity, valence, complementation, participation, as well as (morpho)syntactic processes like passivization, causativization, reflexivization, compound-forming, etc. and relevant pragmatic and 'textual considerations'. Included as well are publications that tackle the applicability of research into case and thematic relations to disciplines like foreign- language teaching, artificial intelligence and dictionary-compilation.

The 'alphabetical list' is a numbered list of 6643 publications listed alphabetically by author. The numbers do not fulfil a (cross-)referencing function, however, so the extra column needed to include them seems a waste of valuable space.

The subject index is long, detailed and structured, but unfortunately not always accurate and unambiguous. A small anthology of errors and difficulties: under the index entry GRAMMATICAL RELATION there are references to Keenan 86 and Langdon 89, but neither of these is in the alphabetical list; in the same entry Robinson 70-71 refers to a reprint and a translation of a 1968 publication which are not listed separately; under VALENCY we find a reference to Biese 76, but there is only a Biese 1928 and a Biese 1950, and... a Biere 1976; under ERGATIVITY there is a reference to Johnson 76 but there are no less than six Johnsons listed, two of which have 1976 publications.

Finally, the 'language index' is not really an index, since it does not contain references to the list of publications. Campe explains that it is included 'to allow the reader to check if the language that interests him is discussed in the bibliography' (p. 12), but if readers are interested in, say, the dative in the Nilo-Saharan language Ik, they could, much more efficiently, go straight away to DATIVE in the subject index and look for Ik there. The 'language guide' informs them that Ik belongs to the group of East-Sudanic Kuliak languages, but is that the task of a bibliography? In other words, without the 'language index and guide' the book would not have been a worse bibliographical tool, but it would have been 35 pages less expensive.

Nitpicking aside, for research students and professional researchers alike who are embarking on new research projects specialist bibliographies like this one are always gefundenes Freßen because they allow one to have a bird's eye view of a particular research area and provide the novice with good places to start. The Leuven group should therefore be applauded for having made its groundwork available to the research community at large.

[table of contents]

Personalia of the supervisors

Anne-Marie Simon-Vandenbergen

Studied Germanic languages at the University of Gent and graduated with a dissertation on Dutch equivalents of the English progressive. She obtained an M.A. degree in Linguistics at the University of Reading in 1976, and a PhD from the University of Gent in 1978, with a thesis on British newspaper headlines. She was appointed Professor of English language at the University of Gent in 1987, and is currently also head of the English department at the same university. Her publications are in the area of English grammar and register variation within a functional framework. She is co-editor (with Kristin Davidse and Dirk Noël) of the journal Functions of Language.

Luc De Grauwe

Studied Germanic philology at the University of Gent and graduated in 1970. From 1970 till 1976 he was a Research Associate of the National Fund for Scientific Research. In 1975 he obtained a PhD from the University of Gent with a thesis on the Wachtendonckse Psalmen. In 1976 he became assistant lecturer in the Department of German Linguistics, and since 1991 he holds the chair of German Linguistics. His lectures and publications are in the area of historical linguistics, etymology and Old Germanic languages, as well as lexicology, phraseology, word-formation and contrastive linguistics.

Johan Taeldeman

Graduated from the University of Gent in 1966. In 1976 he successfully submitted his PhD thesis, a generative-phonological study of the Kleit dialect, which he wrote while he was assistant lecturer in the Department of Dutch Linguistics of the same University. From 1977 to 1987 he was a Research Associate of the National Fund for Scientific Research, and since 1988 he holds the Gent chair of Dutch Linguistics. Supervisor of several research projects. National and international publications and lectures on phonology, morphology, word-formation, socio-linguistics and dialectology.

Dominique Willems

Dominique Willems (1948) read Romance philology at the University of Gent, where, upon graduating in 1966, she became assistant lecturer in the Department of French Linguistics. She obtained her PhD in 1975 and after several scholarly visits to France, Canada and the United States she became Professor and Chair of French Linguistics in Gent. She now supervises several research programmes in the areas of descriptive, contrastive and computational linguistics and takes part in international research on oral data and lexicology. She is also Director of the scientific review Travaux de Linguistique and is currently Dean of the Faculty of Arts.

[table of contents]

To the table of contents of other CONTRAGRAM issues

Newsletter 2

No 2 May - 1995

CONTENTS

The corpora used by CONTRAGRAM

Corpus research and the patterns of prétendre

References

Bilingual dictionaries and corpus research

References

Frequency data from corpus research: the case of beslissen

Reference

Book notice

Personalia of the supervisors

Anne-Marie Simon-Vandenbergen

Luc De Grauwe

Johan Taeldeman

Dominique Willems

No 2
May - 1995