Newsletter 7

No 7
September - 1996


CONTENTS


Corpus update

Dirk Noël

Contragram is proud to announce that the English section of the team has now come of age corpuswise with the acquisition by the University of Gent's Department of English of a licensed copy of the British National Corpus. As reported in the second issue of this newsletter, the corpus that was used up until recently as the primary source of the English data for Contragram's Contrastive Verb Valency Dictionary was the Lancaster-Oslo/Bergen (LOB) Corpus, available on the ICAME Collection of English Language Corpora CD-ROM (1991). Though this still is an excellent research tool, and at one time used to be quite revolutionary even, is has recently been outstripped by other corpus initiatives with which it can no longer compete in terms of size and scope. One of these is the British National Corpus (BNC), which was created by an academic-industrial consortium consisting of the Oxford University Press, the Longman Group Ltd, Chambers Harrap publishers, the Oxford University Computing Services, the Unit for Computer Research on the English Language of the University of Lancaster and the British Library Research and Development Department, and which forms the basis of the latest editions of the Longman Dictionary of Contemporary English and the Oxford Advanced Learner's Dictionary of Current English. With its 100 million words it is a hundred times larger than the LOB corpus, and whereas the LOB consists of "only" 500 samples of exclusively written texts, printed in Great Britain in 1961 (yes, 35 years ago already), the BNC contains extracts from 4124 modern British English texts of all kinds, both spoken and written, most of which do not date back further than 1975. (For more information on the composition of the BNC, see Burnard 1995 or the BNC Web site.)

Both the size of the BNC and the fact that the samples it contains are more recent than those in the LOB have important advantages for the Contragram research. Because the aim set for the point of departure of our descriptions of verb complements was an analysis of between 200 and 250 concordances of each of the verbs, this was a problem when using the LOB, for only the 100 or so most frequently occurring verbs have more than 200 tokens there. Verbs like ARGUE, ENABLE, PREVENT and PUSH, for instance, which do not belong to the top 100, only occur 97 times each. In the BNC, ARGUE occurs 14,864 times, ENABLE 10,237 times, PREVENT 10,557 times and PUSH 9,474 times. The corpus query software that is delivered with the BNC, SARA (SGML-Aware Retrieval Application), makes it very easy to compile a random selection of 250 concordances from these, doing almost all the work for you. It also makes it possible to move beyond the initial sample very quickly to search the whole corpus for specific patterns.

The second advantage has to do with the choice of examples, an all-important stage in the compilation of dictionaries, especially those aimed at foreign language learners (see Fox 1988, Williams 1996). If you do not have very many examples of a particular verb to start with, then the chances are high that it will be especially difficult to find natural illustrations of its less frequent patterns that do what dictionary examples should do, i.e. to reinforce the meaning of the verb in a particular pattern by showing how it is actually used, in an appropriate context, with words that are normally associated with it (Fox 1988: 137). Real-life examples often make reference to a mysterious or untypical unspoken situational context or refer to a cultural context the learner is unlikely to know about (Williams 1996), but naturally the more examples you have to choose from the easier it is to avoid these. The more recent cultural context of the BNC texts, as compared to that of the LOB extracts, might not make the selection of examples easier (since it will be more difficult for the selector to spot the cultural bias), but it will allow the researcher to titivate their papers with slightly less dated examples than (1), e.g. (2).

(1)
No details of their mission have been disclosed, but it was reported earlier in Laos that prince Boun Oum was considering asking prince Souvanna Phouma to join his government. (LOB)

(2)
Home Office officials have admitted combing through dusty immigration files from the 1960s looking for evidence that Mr Clinton considered applying for British citizenship to avoid the Vietnam draft. (BNC)

References

  • Burnard, L. (1995) Users Reference Guide for the British National Corpus. Oxford: Oxford University Computing Services.
  • Fox, G. (1988) The case for examples. In J. M. Sinclair (ed.) Looking Up: An account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary. London: HarperCollins Publishers. 137-149.
  • Williams, J. (1996) Enough said: The problem of obscurity and cultural reference in learner's dictionary examples. In M. Gellerstam, et al. (eds.) (1996) Euralex '96 Proceedings I-II: Part II. Göteborg: Göteborg University, Department of Swedish. 497-505.

[table of contents]


Forum: The Birmingham Corpus Linguistics Group

Susan Hunston, University of Birmingham

Background

Corpus Linguistics has been a feature of research in the University of Birmingham since 1965 (Sinclair et al forthcoming). Since the early 1980s, research has been carried out in collaboration with COBUILD, a subsidiary of HarperCollins Publishers. The COBUILD project has focused on lexicography and the exploitation of a large corpus to produce English language teaching reference materials. The corpus group in the School of English works on analytic software (with Prof. J. Sinclair) and is extending its area of interest to historical corpora, literary corpora and the humanities generally under its new director, Dr G Barnbrook. Other corpus-based projects at the university include the development of parallel corpora as a translation aid, and the use of corpora in the teaching of English for Academic Purposes.

The main source of corpus information in Birmingham is the Bank of English, which now has over 300 million words of running text. It is composed of a wide range of different types of writing and speech, including books, newspapers and magazines, casual conversation and unscripted radio broadcasts, and includes data from Britain, the United States and Australia. The corpus consists of language which is general rather than technical and it does not include non-standard dialects. It is possible for researchers worldwide to gain access to a 50 million word sample of the Bank of English, using the telnet-protocol. The service is called CobuildDirect; further details may be obtained by e-mail to: direct@cobuild.collins.co.uk

In addition to the Bank of English, which is held at COBUILD, the corpus linguistics group holds a variety of multi-lingual corpora for the purpose of developing language-independent software.

Principles

One of the principles of the corpus linguistics group at Birmingham is that 'a corpus should be as large as possible, and should keep on growing' (Sinclair 1991:18). There are two main reasons for this. The first is that some words, and some usages of some words, are relatively infrequent, so that a corpus of many millions may be needed to get a reasonable number of instances of a particular word or phenomenon (Sinclair 1991:19). The second is that, as language varies considerably, according to speaker, situation and subject-matter, in a small corpus the irregularities, or variation, may mask the regularities. In other words, 'because of the vast range of variation there must be a large enough repository of text to prioritise the recurrent patterns in order to provide a basis for description' (Sinclair et al forthcoming). For these reasons, there is no upper limit set to the Bank of English corpus and the other corpora used by corpus linguistics at Birmingham.

Because of the emphasis on very large corpora, much of the software development in the corpus linguistics group is aimed at the efficient retrieval from a large corpus of only those instances of a word that are of interest to the researcher, and at the automatic processing of text. Various programmes have been developed which allow studies in collocation and sense disambiguation.

The principle of analysis that is predominant in the corpus linguistics group at Birmingham is based on Sinclair's notion of the 'idiom principle', and the notion that most words occur in ways that are highly patterned and, to a surprisingly large extent, predictable. The idiom principle states that 'a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments' (Sinclair 1991: 110). The idea that a single language choice typically constitutes much more than a single word generalises the common observation that 'lexical phrases' are an important part of the language user's repertoire into a statement about the language as a whole. It is an insight that incorporates collocation, semantic prosody and all kinds of language patterning. As Sinclair says, 'The new evidence suggests that grammatical generalizations do not rest on a rigid foundation, but are the accumulation of the patterns of hundreds of individual words and phrases' (Sinclair 1991: 100).

One of the key observations arising out of lexicographic studies of large corpora is that sense/meaning and pattern are so completely interrelated that they are to all intents and purposes the same thing. This is most clearly demonstrated when one sense of a word is associated with one pattern and another sense is associated with another. For example, the verb recover has (arguably) two senses - to get back and to get better - although literal and metaphorical variants increase the number of senses recorded in any dictionary. In the first sense, the verb is followed by an object, as in The police recovered the stolen goods and in the second sense, it is followed by a prepositional phrase beginning with from, as in He was recovering from a blow on the head. To take another example, the senses of the noun face include part of the body, an expression, part of a mountain or building and an aspect of something. Each of these is associated with a particular pattern: with a possessive, my face; with an adjective, a sad face; with a prepositional phrase beginning with of, the north face of the Eiger; with the, and adjective, and a prepositional phrase beginning with of, the unacceptable face of capitalism.

The work of Sinclair (e.g. Sinclair 1991) and of Francis (e.g. Francis 1993) emphasises the interrelation of lexis and grammar. As Sinclair points out, each word appears to operate as the centre of one or more more-or-less variable phraseologies: it is these 'phraseologies' that make up the most useful units of lexical-grammatical analysis. Francis (1993: 155) sets out the agenda: 'The end result will be that we will be able to specify all major lexical items in terms of their syntactic preferences, and all grammatical structures in terms of their key lexis and phraseology.' The grammar coding in the Collins Cobuild English Dictionary (1995) (hereafter CCED) and the work that has come out of that coding, described below, goes a considerable way to realising that aim.

Grammar coding in the CCED

Representing patterns

The basis of the system of coding in CCED is that it represents pattern, not structure. It aims to represent, in a simple string of words and abbreviations, the common environments of a word, and is only minimally analytical. To exemplify this, here are some possible clauses, together with a conventional functional analysis and the CCED coding:

He is President of the United States.
Conventional analysis: verb + complement
CCED coding: V n (verb group followed by noun group)

She drank a cup of coffee.
Conventional analysis: verb + object
CCED coding: V n

Pass me that pen.
Conventional analysis: verb + indirect object + direct object
CCED coding: V n n (verb group followed by two noun groups)

He hates me being a dancer.
Conventional analysis: verb + object
CCED coding: V n -ing (verb group followed by a noun group and a clause beginning with a present participle)

Everyone piled out of the car.
Conventional analysis: verb + adjunct
CCED coding: V out of n ( verb group followed by a prepositional phrase beginning with out of)

In short, the code consists of a string of words or abbreviations which indicate word-classes, groups or clauses of a particular kind, together with actual lexical items, such as prepositions, where appropriate.

The pedagogic rationale

The first rationale for coding the entries to CCED in this way was to make the grammar coding more transparent for the second language learner. It was argued that what matters to the learner is that a verb is, for example, followed by a noun group. Whether that noun group is a complement or an object is of no importance. Similarly, if a verb, noun or adjective is commonly followed by a particular preposition, it is that fact that is important, not the structural role of the prepositional phrase. Reducing the degree of analysis involved in each coding made inconsistencies in coding less likely and resulted in a grammatical description that was simple to use without over-simplifying the language.

Using a fairly small number of transparent labels, even very complex information about the behaviour of a word can be given in the limited space available in a dictionary. Here, for example, are the codes for the adverb everywhere, with examples:

'n ADV' (adverb follows a noun group): Working people everywhere...
'ADV after v' (adverb follows a verb group, immediately or following another element): We went everywhere together
'be ADV' (adverb follows the verb be): Dust is everywhere
'from ADV' (adverb follows the preposition from): People come here from everywhere...
'ADV cl/group' (adverb is followed by a clause or a noun group or prepositional phrase): ...everywhere in the Algarve

An additional benefit was that words of all word-classes could be coded using the same system. For example, the verb decide, the noun suggestion and the adjective afraid are each sometimes followed by a that-clause, and the codes reflect this similarity: 'V that', 'N that' and 'ADJ that' respectively.

The theoretical rationale

There is more to the coding in CCED than simple pedagogic expediency, however. We would argue that a string of word, group and clause-type indicators is preferable theoretically as well as practically to the traditional labels of 'object', 'complement' etc. The advantages of the CCED-style coding become apparent when a large corpus allows the observation of most behaviours of most word-senses. In this situation, instead of selecting words which fit a particular notional structure, the grammarian must code each of these behaviours. Three main advantages can be noted.

Firstly, lengthy or intricate patterns that do not lend themselves easily to a functional/structural analysis can be represented without difficulty (see also the representations of face and everywhere above). We believe, for example, that there is no single satisfactory analysis of these examples, shown here with their CCED coding (see also Francis 1994):

He talked me into taking out a loan. 'V n into -ing'
It took me a further three hours to finish the job. 'it V n n to-inf' (to-inf = clause beginning with a to-infinitive)
They lied their way out of trouble. 'V way prep/adv' (prep/adv = a range of prepositions and adverbs)

Secondly, the CCED method of coding shows clearly the salience of patterns involving particular prepositions. These are difficult to describe usefully within the confines of a single functional system. Here are some examples, again with their CCED coding:

Forget about being friendly. 'V about -ing'
She complained of a headache. 'V of n'
Someone swapped the blank for a real bullet. 'V n for n'
She had reconciled herself to never seeing him again. 'V n to -ing'

Thirdly, whilst traditional analyses incorporate some useful insights (such as the difference between He made me a good husband and He made me a tea-cosy, both coded 'V n n' in the CCED system), they also demand that distinctions be made that are less than useful. Here are three examples of the pattern 'V n -ing', each with a different functional analysis:

He hates me being a dancer.
verb with object

They watched him cutting the grass.
verb with two objects, or object and object-complement

He passed the time smoking.
verb with adjunct

Although these distinctions are defensible, they do not give as much information about the verbs hate, watch and pass as the pattern 'V n -ing' does. The pattern is both a necessary and sufficient description of those verbs with that behaviour.

Pattern and meaning

One of the most significant insights arising from the grammar coding shown above is that words that share patterns tend also to share meanings. This is basis of the new Collins Cobuild Grammar Patterns series, the first of which, dealing with verb complementation patterns, has recently been published (Francis et al 1996).

An example of a simple verb pattern is 'V between n'. It is not a common pattern: only 17 verbs in CCED (one with two senses) are coded as having this pattern. Most of these verbs, with the exception of choose and commute, fall into one of four meaning groups:

sorting out relationships: adjudicate, arbitrate, liaise, mediate;
recognising differences: differentiate, discriminate, distinguish;
doing or being two things alternately: alternate, flit, oscillate, vacillate, waver;
having a range of values: hover, oscillate, range, vary.

A more complex pattern with more verbs is 'V n on n'. There is insufficient space here to give a full account of the pattern, but here are some of the meaning groups:

giving something to someone: bestow, confer, heap, lavish, press, settle;
doing something unpleasant to someone: blame, dump, foist, force, impose, inflict, palm off, perpetrate, spring, thrust;
speaking or writing about a particular topic: advise, compliment, consult, counsel, instruct, lecture, press, question;
putting something somewhere: cast, clip, cram, load, mount, perch, place, plunk, prop, put, set, sprinkle;
focusing attention or feelings: centre, concentrate, direct, fasten, fix, focus, pin, project, turn;
directing a weapon: fix, pull, train, turn;
striking one thing against another: bang, catch, drum, hammer, rap, slap, snag, strike, wipe;
touching or hitting someone on a part of their body: beat, catch, clap, clout,, hit, pat, slap;
writing or etching something: carve, engrave, etch, imprint, inscribe, print, write;
basing one thing on another: base, build, ground, predicate;
gambling: bet, gamble, stake, wager;
spending, saving or wasting resources: blow, save, spend, waste;
modelling one thing on another: model, pattern.

Similar meaning groups emerge for each of the patterns noted in the CCED. Although at the moment only the verbs in CCED have been fully investigated in this way, the CCED database provides a resource to study all the word-classes in a similar manner.

Implications

As noted above, the observation that every lexical item has a describable pattern or patterns, and that patterns are restricted in the lexical items that occur in them, indicates that the traditional dichotomy between lexis and grammar cannot be upheld. The observation that words that share patterns tend also to share meaning indicates that there is no dichotomy between pattern and meaning.

We believe that the pattern notation, with suitable multi-lingual adaptations, provides a useful metalanguage for coding and classifying lexical items of any word class in any language, thus allowing cross-linguistic comparisons to be made.

References

  • Francis G 1993 'A corpus-driven approach to grammar: principles, methods and examples' in Baker M et al (eds) Text and Technology. Amsterdam: Benjamins.
  • Francis G 1994 'Grammar teaching in schools: what should teachers be aware of?' Language Awareness 3, 221-236
  • Francis G, Hunston S and Manning E 1996 Collins Cobuild Grammar Patterns 1: Verbs. London: HarperCollins.
  • Sinclair J 1991 Corpus Concordance Collocation. Oxford: Oxford University Press.
  • Sinclair J, Jakobs O and Lawson A forthcoming 'Computers and Humanities'.

[table of contents]


Compte-rendu

Bart Defrancq

Fuchs, C. (1996) Les ambiguïtés du français. Collection L'essentiel français. Gap/Paris: Ophrys. 164 pp. (ISBN 2-7080-0772-6)

L'ouvrage de C. Fuchs se veut une introduction générale au phénomène de l'ambiguïté linguistique, introduction qui ne s'adresse pas seulement aux initiés de la linguistique mais aussi à un public plus large, car l'ambiguïté tend ses pièges à tous. Les questions ne sont dès lors pas traitées de manière détaillée: l'auteur a plutôt cherché à présenter un tour d'horizon aussi large que possible qui comprend des questions relatives à la définition de l'ambirguïté, sa délimitation par rapport à d'autres phénomènes d'équivoque, les stratégies que les locuteurs mettent en oeuvre afin d'éviter, de créer ou de résoudre l'ambiguïté. Ces questions théoriques sont regroupées dans la première partie de l'ouvrage ('Caractériser l'ambiguïté'). Les exemples et la classification des différents types d'ambiguïtés figurent dans la deuxième partie ('Classer les ambiguïtés').

C. Fuchs définit l'ambiguïté comme l'association à une seule expression de plusieurs significations mutuellement exclusives et ce à un niveau d'analyse identique. A tous les niveaux d'analyse inférieurs à la phrase ('énoncé phrastique'), les ambiguïtés ainsi définies ne sont que virtuelles parce qu'elles peuvent être levées aux niveaux supérieurs gráce au concours du contexte linguistique. Ce n'est que quand la pluralité des significations subsiste au niveau de la phrase que l'on peut parler d'ambiguïté effective. Que celle-ci soit levée ou non à des niveaux supérieurs ne change plus rien à son statut. Les raisons de ce choix ne sont pas clairement exposées (parler de l'autonomie de la relation prédicative au niveau de la phrase semble exclure certaines catégories de phrases complexes, alors qu'il y a plusieurs exemples de ce type), mais l'on se doute qu'il s'explique par une bonne dose de pragmatisme.

Le rôle désambiguïsant du contexte se trouve quelque peu relativisé par la suite: tout d'abord, l'auteur nous montre que le contexte par son apport d'informations peut avoir pour effet de créer des ambiguïtés. Ensuite, plusieurs autres facteurs permettent d'annuler l'ambiguïté: d'une part la prosodie (sur laquelle l'auteur insiste à juste titre) et la ponctuation, d'autre part des connaissances extra-linguistiques. C'est notamment gráce à ces dernières que le récepteur humain est capable de lever ou n'aperçoit même pas la plupart des ambiguïtés. C. Fuchs illustre comment le récepteur machine qui ne dispose pas de ces connaissances peut se trouver dans l'impossibilité d'annuler une ambiguïté apparente. Quant à la question de savoir de quelle façon le récepteur humain fait intervenir ces connaissances linguistiques et extra-linguistiques dans le cas d'une ambiguïtté, l'auteur s'abstient de prendre position dans le débat qui oppose modularistes et interactivistes. Du côté de l'émetteur, l'accent est mis sur les stratégies qu'il met en oeuvre pour éviter l'ambiguïté ou pour la créer consciemment.

Dans la deuxième partie de l'ouvrage, C. Fuchs entreprend la classification d'un grand nombre d'exemples d'ambiguïtés en procédant du plus concret au plus abstrait. Les ambiguïtés créées aux niveaux morphologique, lexical et syntaxique sont traitées avant les ambiguïtés prédicatives, sémantiques et pragmatiques. Si les premiers niveaux se caractérisent par des difficultés de segmentation et de caractérisation, aux autres niveaux prédicatif et sémantique les ambiguïtés sont présentées comme de type relationnel. L'on s'étonne dès lors quelque peu de ne pas retrouver cette dénomination pour les ambiguïtés syntaxiques, mais ceci est dû sans doute au fait que l'auteur se limite à ce qu'elle appelle la "syntaxe syntagmatique". En plus, sans vouloir trop insister sur les détails d'une classification qui a le mérite de représenter clairement les différents types d'ambiguïtés, il semble que certains types qui relèvent du niveau pragmatique ne se distinguent que très peu de certaines ambiguïtés sémantiques. Par exemple, l'ambiguïté sémantique dans un texte ne pose aucun problème (il n'est pas problématique à lire / il ne soulève aucune question problématique), ne diffère pas substantiellement de l'ambiguïté pragmatique dans Cette table, elle bouge (Est-ce qu'elle est en train de branler? Faut-il la caler? / Est-ce qu'on peut la bouger? Est-elle mobile).

Ces quelques détails ne diminuent pas substantiellement la qualité de l'ouvrage en tant qu'introduction à la problématique. Si certains domaines ne sont qu'effleurés, l'auteur suscite toujours la curiosité du lecteur en le renvoyant à des études plus approfondies. Elle s'est en plus appliquée à observer une rédaction simple et agréable à la lecture, ce qui fait ressortir avec d'autant plus d'acuité certaines coquilles. Il serait souhaitable de supprimer celles-ci lors d'une deuxième édition que l'ouvrage mérite sans conteste.


[table of contents]


To the table of contents of other CONTRAGRAM issues