Newsletter 3

No 3

September - 1995

The proto-equivalent
Do bilingual dictionaries suffer from a false friends syndrome?
Forum: Hans Paulussen (University of Namur) Compiling a trilingual parallel corpus

The proto-equivalent

Bart Defrancq

Preparatory research for the CONTRAGRAM Dutch-French-English Contrastive Verb Valency Dictionary (CVVD) has revealed that a contrastive study benefits enormously from a one to one comparison of the syntactic and collocational environments of single lexical items that are treated as 'prototypical equivalents' - prototypical, because total equivalence cannot be assumed to exist. The CVVD only offers alternative (less typical) translations when the so-called 'prototypical equivalent' cannot be used. This will give the user of the dictionary a better insight into the semantic and syntactic scope of each of the examined lexemes, because the full lexicogrammatical potential of each of the proto-equivalents in the three languages is represented. What the user will be confronted with could look like something like the following figure, where aaaaaaaaaa, bbbbbbbbbb and ccccccccccc are the prototypical equivalents of L1, L2 and L3 respectively, and the lexemes preceded by 'º ' are necessary cross-references to other entries in the dictionary when the proto-equivalent cannot be used in one or two of the three contrasted languages.

The *CVVD* entry
Dutch	French	English
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa	bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb	ccccccccccc ccccccccccc ccccccccccc
aaaaaaaaaa aaaaaaaaaa	bbbbbbbbbb º þþþþþþþþþþ	º ©©©©©© cccccccccc
º ââââââââââ º ââââââââââ	bbbbbbbbbb bbbbbbbbbb	º çççççççççç cccccccccc

Traditional bilingual dictionaries, on the other hand, are based on a one to many relation between the items in the source language and in the target language respectively and and as a result they suffer from a considerable lack of systematicy. The typical scope description of bilingual dictionaries looks like this:

The bilingual dictionary entry
Source Language	Target Language
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa	ccccccccccc ccccc/çççç çççççççççç çççççççççç ©©©©©© ©©©©©©

Here none of the suggested alternatives takes precedence over any of the others. The most evident effect of this one-to-many approach on the macrostructure of the dictionary is that it will be divided into two separate volumes, an L1/L2 volume and an L2/L1 one. The effects on the microstructure of the lemmas are less visible but not less subversive. By allowing themselves to multiply the possible translations bilingual lexico-graphers exempt themselves from examining the possibilities of any of them exhaustively. In their articles on the role of grammar and the problem of false friends in bilingual dictionaries Dirk Noël et al. (1995) and Dirk Noël (Contragram 2 and 3) explore the consequences of the one-to-many approach of bilingual dictionaries.

Existing contrastive verb valency lexicons (like Apresjan and Pall (1982) and Schwarze (1985)) do little more than refine this one-to-many approach, their macrostructure being copied from those of traditional bilingual dictionaries.

Because the identification of prototypical equivalents (or 'proto-equivalents') is such an essential stage in the one-to-one approach of the Contragram CVVD - perhaps even more than in other contrastive studies of the lexicon - it might be worthwhile to reflect a bit on the criteria employed in this identification. The least one could expect from a dictionary being that it offers translations for lexemes of a given language, the one-to-one approach will have to provide the most relevant one, disqualifying in some way all the other possible translations. Among the criteria proposed by Bouton (1976), the criteria based on structural equivalence are to be ruled out, one of the purposes of the CVVD being to describe structural contrasts. Pragmatic equivalence, on the other hand, is difficult to pin down. This leaves us with the semantic equivalence for which we developed two parameters: a translational parameter and a frequency parameter.

1. The translational parameter defines the proto-equivalents as follows: the single lexemes in L2, L3... that can translate the widest range of structures the L1 lexeme can enter into are considered to be its proto-equivalents, and all other translations are qualified as non-proto-equivalents.

According to Noël et al. (1995), for instance, English consider displays 11 different patterns, out of which 7 can be translated by French considérer, 3 by envisager and réfléchir. In other words, the translational scope of considérer in relation with consider is larger and considérer will be selected as proto-equivalent of consider.

2. As frequency data based on corpus research are playing an increasingly important role in the Contragram research, the proto-equivalent search procedure could also take into account the frequency criterion. The frequency criterion defines proto- equivalents as the single lexemes in L2, L3, ... that can be used to translate the patterns that occur most frequently in the L1 corpus.

Again according to Noël et al. (1995), more than 33% of the occurrences of consider have the meaning 'have the opinion that' and 17% mean 'to look at', two meanings that are translated almost exclusively with French considérer, which means that almost 1 out of 2 occurrences of consider will be translated with considérer. No other equivalent reaches this percentage. The frequency parameter, therefore, also decides in favour of considérer.

However, even though in most cases the results of the application of the two parameters are identical, as illustrated by consider/considérer, there are exceptions to the rule. Consider the Dutch verb voorstellen, which has 17 patterns. In the translation grid displayed in the table below, both propose and imagine come out in first place with 6 translations for patterns of voorstellen each.

*voorstellen* vs. *propose/imagine*
Subject	voorstellen	1st compl.	2nd compl.	English equivalent
NP	voorstellen	Ø	Ø	introduce
NP NP	voorstellen zich voorstellen	NP NP	Ø	propose>amount>represent>introduce imagine
NP NP NP	voorstellen zich voorstellen zich voorstellen	NP NP NP	PP'aan' PP'van' PP'bij'	propose>introduce expect conceive
NP NP	voorstellen zich voorstellen	NP NP	als NP als NP	present>introduce imagine
NP NP	voorstellen voorstellen	NP NP	Adj als Adj	present present
NP	zich voorstellen	NP	Adv	imagine
NP NP	zich voorstellen voorstellen	(PP)	Pfin Pfin	imagine propose/suggest
NP	voorstellen	Ø	P"fin"	propose/suggest
NP	zich voorstellen	Ø	Pwh	imagine
NP NP	voorstellen zich voorstellen	(PP)	Pinf Pinf	propose/suggest imagine=propose

(Symbols used are: ">": more frequent than; "=": as frequent as; "/": synonymous)

The translational parameter would in this case be difficult to apply. How can one decide between propose and imagine? The frequency parameter can be of assistance here. It would clearly establish that the patterns of voorstellen translated with propose (some 47,8%) are far more frequent than the others (imagine less than 30%, amount to less than 10%, the others less than 5%). voorstellen and propose would then be considered as proto-equivalents. A thourough study of a bilingual corpus would probably confirm this proto-equivalence. This would in its turn prove the frequency parameter to be more empirically relevant than the translational parameter.

Since the CVVD is meant to be a multidirectional description of the verbal lexicon, it is important that the linguistic instruments with which it is compiled are language independent. The translation scope criterion could in the case of voorstellen / propose-imagine have another result depending on whether Dutch or English were chosen as L1, since the translational parameter would have made it impossible to choose between propose and imagine, while, making abstraction of 3 idiomatic uses which result from an elliptic structure, propose can only be translated with voorstellen. The frequency criterion shows to be more stable in this respect. It will select propose as proto-equivalent of voorstellen and vice versa.

The arguments of empirical relevance and multidirectional stability seem to be in favor of the frequency criterion.

However, problems may arise with the frequency parameter as well. When one looks at the frequency breakdown of the verb beslissen in Contragram 2, p. 11, one will come to the correct conclusion that the patterns translated in French with décider constitute the majority of its uses (NP beslissen PP: 34%; NP beslissen Pfin: 37,5%; NP beslissen Pinf: 2,5%). The proto- equivalence between beslissen and décider is established. The problem is that when one takes into account the frequency breakdown of the Dutch verb besluiten (Contragram 2, p. 11), the pattern representing the majority of its uses (NP besluiten Pinf.: 62,7%) is also translated with décider in French, which would not only lead us to a many-to-one relationship between L1 and L2, but would also have desastrous consequences when taking French as L1 and Dutch as L2. For French conclure must almost always be translated by Dutch besluiten. The description would then again be language dependent. However, in cases like this the translational parameter allows us to correct the results of the application of the frequency parameter by showing that the translational scope of décider in relation with besluiten does not exceed the one pattern described above, while the translation scope of conclure consists of 4 structures (absolute use, NP besluiten NP, NP besluiten NP PP, NP besluiten Pfin).

It seems, therefore, that only the interaction between the two parameters can lead to felicitous results. However, their application should in our view be completed by a thorough investigation of multilingual corpora - like the one outlined in the present Contragram issue by Hans Paulussen - in order to establish definitively that the joint frequency of proto-equivalents in translation is in fact higher than the joint frequency of non- proto-equivalents. The study of D. A. Kibbee (1995) on the equivalence in translation between seem and sembler is an excellent example of such an approach.

References

Apresjan, J. and E. Pall (1982) Orosz ige - magyar ige. Vonzatok és kapcsolódások, köt. 1-2. Budapest: Tankönyvkiadó.
Bouton, L. F. (1976) The problem of Equivalence in Contrastive Analysis. IRAL 14, 2: 143-163.
Kibbee, D. A. (1995) Assertion/atténuation, subjectivité/objectivité en anglais et en français: 'seem'/ 'sembler'. In M. Ballard (ed.) Relations discursives et traduction. Lille: PUF. 73-87.
Noël, D., B. Defrancq and F. Devos (1995) Considering bilingual dictionaries against a corpus: Do English-French dictionaries present "real English"? Lexikos 5. (to appear)

[table of contents]

Do bilingual dictionaries suffer from a false friends syndrome?

Dirk Noël

One of the spin-offs of the CONTRAGRAM Dutch-French-English Contrastive Verb Valency Dictionary (CVVD) is that it sets out whether, and to what extent, verbs that look alike and which are therefore possible 'false friends', actually need to be regarded as such. Because each of its entries fully confronts one single Dutch verb with one single French and one single English verb with which it shares most of its meanings, offering an exhaustive list of all their meanings and all the patterns that can realize each meaning (but also incorporating cross-references to synonyms), the CVVD very precisely spells out what the similarities and differences are between these - what we have termed - 'proto-equivalents'.

Naturally, the traditional bilingual dictionary's brief is more modest, but one of the things coming out of a comparison of general purpose dictionary entries with those in the CVVD is that the former's fragmentariness (see Bilingual dictionaries and corpus research in Contragram 2 and Noël et al. 1995) could create the erroneous impression that certain cross-linguistically related verbs have less in common, both in terms of their semantics and their syntactic patterning, than is actually the case.

A good example is the French-English pair considérer-consider. These share at least four meanings, viz. i) "have the opinion that" (e.g. 1-2), ii) "think carefully about/study something" (3-4), iii) "bear in mind" (5-6), and iv) "look carefully at sb/sth" (7-8).

(1) I fully understand and appreciate your desire not to give reasons in general, but on this occasion you might consider it worth your while to so so. (LOB H:Miscellaneous H19:63)
(2) Et, si l'on considère l'evolution comme inéluctable, ce serait au moins à tenter de maîtriser. (LMCD 1194R4voy)

(3) The more one looks at this, the more one feels that the thing which British sociologists need is to consider the implications of Weber's work for their own. (LOB G:Belle let tres,biog G67:49)
(4) Il s'agira de considérer les effets d'une réalité et non plus de mesurer les limites à apporter à des virtualités. (LMCD 1819412man)

(5) Nevertheless, cross draughts are so variable and unreliable that the assistance they may provide should not be considered when designing a system. (LOB J:Learned,scientific J75:8)
(6) Si l'on considère qu'environ 25.000 gènes sont exprimés au cours de la vie d'une plante, on peut aussi espérer, dans un avenir proche, disposer du premier catalogue complet des gènes d'une espèce végètale. (LMCD 1219413her)

(7) He paused to consider her. (LOB P:Romance,love story P08:159)
(8) Il me considérait, me disait un bonjour froid. (LMCD 2519416mor)

However, they are not always offered as translational equivalents in French-English/ English-French dictionaries. Oxford-Hachette (1994), for instance, translates I consider her (to be) a good teacher with je pense que c'est un bon professeur rather than with je la considère comme un bon professeur, and only illustrates considérer in the part of the entry for consider which it glosses "regard" (our "have the opinion that") in combination with a finite clause: to consider that becomes considérer/estimer que. In the part of the entry glossed "take into account/bear in mind" considérer is not mentioned at all and when you consider that, for example, receives the translation quand on songe que rather than the perfectly possible quand on considère que.

Similar omissions can be observed in Harrap's (1980) and Larousse (1993). In Harrap's the examples illustrating the "bear in mind" meaning of consider are rendered with avoir égard à, regarder à, montrer de la considération, ménager, tenir compte and penser que, and in the same part of the entry in Larousse considérer is only used in the translation of all things considered (tout bien considéré), the other examples using prendre en considération, tenir compte de and penser à. Other gaps are:
- the absence of considérer in the part of the entry in Larousse glossed "contemplate - face, picture, scene" (our "look carefully at sb/sth"): only examiner and observer are mentioned here;
- the absence of the considérer que construction in the list of examples illustrating the "have the opinion that" meaning in Harrap's: we consider that he ought to do it becomes à notre avis il doit le faire instead of nous considérons qu'il doit le faire.

Whether these omissions are simply the result of the selectiveness imposed on compilers of bilingual dictionaries, or whether they can be attributed to an exaggerated anxiety about false friends on their part, does not alter the fact that they result in an incomplete picture of the semantic and syntactic correspondence of pairs like consider-considérer, and this may lead the naive user to conclude that they are indeed, to a greater extent than is actually the case, false friends. Only a non-selective contrastive approach like the one adopted for the CVVD (see A Dutch-French-English contrastive verb valency dictionary in Contragram 1) can establish the true nature of their friendship.

References

Harrap's (1980) Harrap's Standard French and English Dictionary. Edinburgh: Harrap.
Larousse (1993) French-English/English-French Dictionary: Unabridged. Paris: Larousse.
Noël, D., B. Defrancq and F. Devos (1995) Considering bilingual dictionaries against a corpus: Do English-French dictionaries present "real English"? Lexikos 5. (to appear)
Oxford-Hachette (1994) The Oxford-Hachette French Dictionary. Oxford and Paris: Oxford University Press and Hachette Livre.

[table of contents]

Compiling a trilingual parallel corpus

Hans Paulussen

In order to study the syntactic and semantic structure of the preposition and particle up/on and its prototypical equivalents in French (sur) and Dutch (op), we have compiled a trilingual parallel computer corpus of 2,000,000 words, in which all texts are aligned at paragraph level. Linked with this source corpus, a trilingual database TRIPTIC (TRIlingual Parallel Text Information Corpus) has been developed to facilitate the linguistic analysis of selected sample sentences. Below we give a brief description of the corpus content, the compilation process and the tagging structure.

1. Content

The corpus is divided into two subcorpora (fiction and non-fiction), each containing approximately 1,000,000 words. The non-fiction subcorpus consists of extracts from the UNESCO Courier (1993) and from the verbatim transcription of the plenary sessions of the Debates of the European Parliament (1993) (further called the Debates subcorpus). The fiction subcorpus consists of extracts from 12 original fiction texts, four per language. From each of these works, all published between 1982 and 1992, the first chunk of approximately 25,000 words (up to the end of a sentence) was selected. Twelve original extracts translated into two languages gives a total of 36 extracts.

2. Compilation

Thanks to the increasing automation of text creation, we have been able to obtain all non-fiction texts in electronic form. For the fiction part, on the other hand, we preferred scanning the printed version, instead of stripping an intricately encoded typesetting tape. In both cases, some cleaning was required.

2.1. Scanning

Scanning has arrived at a stage where it can compete with retyping a text. A professional typist's work is certainly far less error prone than the average scanning output, but in both cases re-reading the print-out is necessary. Moreover, an OCR (Optical Character Reader) scans faster than a professional typist can type.

Scanning of textual material has improved a lot over the last ten years, although there are still a number of scanning errors. Typical examples are: the confusion between the letter "l" and the digit "1", the recognition of "m" instead of "rn", the interpretation of capital "D" as the combination "1)" as in "1)id you ...". A separate problem involves the recognition of hyphens at the end of lines: these were always interpreted as split words, which in 95% of cases was a right guess. The remaining 5% had to be recovered by hand. These errors could only be detected by carefully re-reading the output, which was done by three native speakers, one for each of the three languages. Fortunately, on the whole, the number of errors was indeed quite small.

2.2. Cleaning

Though the non-fiction corpus was available in electronic form, there were a number of compatibility problems, as the texts were created with different word processors and/or on different types of computer. Each word processor uses its own typographic encoding system, incompatible with others. Moreover, file formats are very often not directly transferable between different computer platforms.

The only practical solution for maintaining transparency and compatibility consists in storing the texts as ordinary ASCII- files, which in most word processors is called "text only". Unfortunately, in this case all character and some paragraph formatting is lost. In some cases, however, the character formatting gives important information, especially on prosody. Only when ambiguity occurred in the character formats have we retained the formatting information by changing it into single or double quotes, depending on the context. In addition to checking the character formats, some cleaning was carried out to strip redundant formatting material: e.g. extra blanks and tabs, indented margins, page breaks, etc. Most of the cleaning was done automatically, by using WordPerfect macros and SED and AWK procedures. (SED and AWK are string manipulation languages originally written for UNIX, but now also found on DOS and Macintosh platforms.)

2.3. Paragraph alignment

With a view to contrastive research, the texts also had to be aligned for the three languages. The alignment was done at paragraph level, which means that each nth paragraph in one language is aligned with the same nth paragraph in the other two languages. In other words, we can select any paragraph in one language and automatically retrieve the corresponding paragraphs in the other two languages. Most of the alignment was done by hand, with the help of AWK procedures and word processing macros.

A paragraph is simply defined as a block of lines ending in a blank line. It is clear that here the notion of paragraph is not defined linguistically but simply on a computational basis. In this way, a title, which takes only one line, is also considered a paragraph.

On the basis of the notion of paragraph, we can define the corpus structure as follows: a corpus consists of a number of subcorpora, each containing a number of corpusfiles. A corpusfile consists of paragraphs. The corpus can then be formalized in the following way, where ":=" means "can be rewritten as" and where "+" means "1 or more":

corpus     :=  subcorpus+
subcorpus  :=  corpusfile+
corpusfile :=  paragraph+

All files are structured according to this format. The Debates subcorpus, however, has some more structured information, which has been used for selecting only those paragraphs which have English, French or Dutch as source language. This selection was necessary, as the Debates subcorpus contains nine possible source languages. The selection was carried out semi-automatically for the Dutch part and transferred automatically to the French and English files.

3. Tagging

We are in the process of tagging the corpus, using two levels of encoding. In the first phase, minimal tags are entered in the source texts. In a second phase, sample sentences are selected and transferred to the TRIPTIC database where the prepositions are encoded extensively.

The minimal tagging consists in delimiting the preposition with "sharps" (#), delimiting the context with braces ({}), and adding some initial lexical information (e.g. lemmatisation) between square brackets ([]), e.g.:

   ... {[on: to lie on] The man lies #on# 
   the bed, his body exposed to the breeze,}
   ...

Note that this is the basic format, where the sample contains only one preposition in a particular context. Encoding can be more complicated, e.g. different number or different order of prepositions between languages. Moreover, the tagging system has to take account of the parallelly aligned data. Interlingual markup demands a strict protocol in order to cope with the intricacies of multi-lingual tagging.

The advantage of minimal tagging is mainly the speed at which text can be encoded and the possibility of viewing the samples in a wider context. An important disadvantage is the impossibility of grouping similar sample types, due to the sequential structure of the material.

A number of AWK procedures have been developed (i) to check the integrity of the source texts and the minimal tagging, and (ii) to transfer the selected samples into the database format (4th Dimension on Macintosh).

The samples are now being further tagged in the TRIPTIC database. This type of extensive tagging includes semantic and syntactic information. The following example gives a selection of the fields used. The use of indexed fields works as a magnifying glass, clustering samples of similar structure or pattern, thus facilitating a more extensive linguistic analysis.

   PREP:       on
   LEMMA:      to lie on a bed
   SYN:        V PREP NP
   TRAJECTOR:  man
   LANDMARK:   bed
   SAMPLE:     The man lies #on# the bed, 
               his body exposed to the breeze

Although the trilingual parallel computer corpus has been originally compiled for the analysis of prepositions and particles in English, French and Dutch, the same corpus can of course be reused for research on other topics in contrastive analysis.

Hans Paulussen

As a translator and a computational linguist, Hans Paulussen has worked in the field of foreign language teaching, automatic grammatical tagging and corpus compilation. He is presently teaching English and Dutch at the University of Namur, and working on a dissertation on the contrastive analysis of prepositions/particles in English, French and Dutch.

[table of contents]

To the table of contents of other CONTRAGRAM issues

Newsletter 3

No 3 September - 1995

CONTENTS

The proto-equivalent

References

Do bilingual dictionaries suffer from a false friends syndrome?

References

Compiling a trilingual parallel corpus

1. Content

2. Compilation

2.1. Scanning

2.2. Cleaning

2.3. Paragraph alignment

3. Tagging

Hans Paulussen

No 3

September - 1995