Project manager: Hercules Dalianis.
Other participants: Sumithra Velupillai, Martin Hassel, Martin Rimka, Viggo Kann (KTH), Bart Jongejan (CST), Jussi Karlgren (SICS).
Collaboration partners: Euroling AB, KTH, Algoritmica HB,  CST - University of Copenhagen, SICS, University of Oslo, University of Bergen, University of Helsinki and University of Iceland.
Funding: Vinnova, through Euroling AB and Nordic Council
Project period: May 1, 2005 – December 31, 2008 

Project description

The goal of the project TvärSök was to construct a cross-lingual search engine for the Scandinavian languages. The users are ordinary Scandinavians that master one of the languages Danish, Norwegian and Swedish, but have only passive knowledge in the other two neighbor languages. This means that we can read a text but not search for it since we do not have active knowledge of how the different concepts in the other languages are written or spelled. 

Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extracted parallel corpora for each language pair. The corpora were very sparse, containing on average less than 80 000 words per language pair. We have used the Uplug word alignment system for the creation of the dictionaries. Uplug use parallel corpora that are manual translations of a text as input to statistically decide the most probable translation of a certain word. 

One specific problem is when the parallel corpora are not completely parallel then one wants to detect this we carried out a experiment using Fingerprints to detect real parallel text pairs, we obtained 87 percent correct matches.

The results from Uplug gave on average 213 new dictionary words (frequency  > 3) per language pair. The average error rate was 16 percent. Different combinations with Finnish had a higher error rate, 33 percent, whereas the error rate for the remaining language pairs only yielded on average 9 percent errors. The high error rate for Finnish is possibly due to the fact that the Finnish language belongs to a different language family. Although the corpora were very sparse the word alignment results for the combinations of Swedish, Danish, Norwegian and Icelandic were surprisingly good compared to other experiments with larger corpora. 

To improve precision and recall in an information retrieval setting one need to use lemmatisation of both the search query and the indexed document. We used both rule based lemmatisers but also machine-learning version of them. One machine-learning approach used both prefix-, infix- and suffix lemmatisation. We obtained a small 1 percent  improvement for Swedish but a up to 23 percent improvement for Polish in precision and recall.

Publications

Hassel, M. and H. Dalianis. 2009. Identification of Parallel Text Pairs Using Fingerprints. In the Proceedings of RANLP'09: Recent Advances in Natural Language Processing, Borovets, Bulgaria, September 14-16, 2009, pdf.

Jongejan, B. and H. Dalianis. 2009. Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In the Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2-7, 2009, pp. 145-153, pdf.

Dalianis, H, M. Rimka and V. Kann, 2009. Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages. In the Proceedings of the 17th Nordic Conference on Computational Linguistics, Nodalida 2009, Odense, May 15-16, 2009. pdf.

Velupillai, S., M. Hassel and H. Dalianis 2008. Automatic Dictionary Construction and Identification of Parallel Text Pairs. In: Proceedings of the International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS), September 25-27, Hangzhou, China, pdf.

Velupillai, S. and H. Dalianis 2008. Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages. In the Proceedings of Workshop MMIES-2: Multi-source, Multilingual Information Extraction and Summarization, Held in conjunction with COLING-2008, Manchester, 23 August, 2008, pdf.

Velupillai, S., M. Hassel and H. Dalianis 2008. Automatic Dictionary Construction and Identification of Parallel Text Pairs. In: Proceedings of the International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS), September 25-27, Hangzhou, China, pdf.

Karlgren, J., H. Dalianis and B. Jongejan 2008. Experiments to investigate the connection between case distribution and topical relevance of search terms in an information retrieval setting. In the Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, May 28-30, 2008, pdf.

Dalianis, H., M. Rimka and V. Kann 2007. Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian. Workshop: The Automatic Treatment of Multilinguality in Retrieval, Search and Lexicography, Copenhagen, April 2007, pdf.

Dalianis, H. and B. Jongejan. 2006. Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST's Lemmatiser, In the Proceeding of the International Conference on Language Resources and Evaluation, LREC 2006, May 24-26, Genoa, Italy, pp 663-666, pdf.