argaw2_.jpg
 

Information Retrieval (IR) deals with finding and presenting information from a collection of documents/data that are relevant to an information need (a query) expressed by a user. Cross Language Information Retrieval (CLIR) is a subfield of IR where queries are posed in a different language than that of the document collection. Computational linguistic tools and resources are essential to accomplish the tasks in CLIR.

 

“CLIR research has been dominated by a very limited number of languages for which computational linguistic tools and resources are available”, Atelach Argaw explains. “In order to have global information sharing, it is important to enable access to information using as many languages as possible.”

Atelach Argaw is from Ethiopia and her native language Amharic is a language spoken by an estimated 30 million people, and well researched, but is one of the languages for which a very limited set of computational linguistic resources are available.

“Most of us look for information on the net using Google for example. It is easy and we never think about it”, Atelach says. “The amount of information one can find that is written in Amharic is extremely small as compared to what is available in the English language . To enable Amharic speakers gain access to the vast information reporsitory that is the internet without having an active English language skill for example requires a CLIR system using the Amharic language with English as a target language.”

In her thesis she lays the basis for cross-language information retrieval for Amharic. Basic tools and resources were created. The CLIR tasks were accomplished using the created and or available resources.

In order to make efficient information retrieval it is essential that the word stems that are the basis for search are identified. And to do that the words in Amharic needed to be transcribed into Latin alphabet, a stemmer had to be implemented, and the automatic translation was based on machine readable dictionaries that were available.

“It is a challenge to work with this with small or almost non-existing resources. I had an Amharic -English dictionary with 15.000 words when I started this study,“ Atelach explains.

In her work she has evaluated the effect of three parameters, namely, transliteration, word sense discrimination, and term selection based on Part of Speech tags, on the overall IR performance.

The study makes use of techniques from the fields of IR, computational linguistics, machine learning and word space models. Atelach is a student at the Graduate School of Language Technology (Gothenburg University) and says she got a lot of feedback and support especially in the areas related to computational linguistics.

“This is the first attempt to a large scale experiment in Amharic CLIR, and as such, it leaves a large  room for improvement. I hope this work would serve as a starting point for upcoming research and I hope more focus would be put to CLIR research using languages with lesser resources.” Atelach Argaw concludes.

 

Abstract
Information Retrieval (IR) deals with finding and presenting information from a collection of documents/data that are relevant to an information need (a query) expressed by a user.
Cross Language Information Retrieval (CLIR) is a subfield of IR where queries are posed in a different language than that of the document collection. Computational linguistic tools and resources are essential to accomplish the tasks in CLIR and to date, CLIR research is dominated by a very limited number of languages for which such tools and resources are available. In order to facilitate global information sharing, it is important to enable access to information using as many languages as possible. This requires an investigation into the feasibility of CLIR for languages with a limited set of computational linguistic resources.
This dissertation provides an in depth investigation into a CLIR system for Amharic (against English and French document collections). Amharic is a well studied language with rich history and culture, but has very limited computational linguistic tools and resources. In this investigation, basic tools and resources were created, and each of the CLIR tasks was accomplished using the created/availabable resources, and resource lenient approaches. Each task was evaluated individually as a stand alone experiment.
IR experiments were then conducted in order to evaluate the effect of three parameters, namely, transliterational, word sense discrimination, and term selection based on Part of Speech tags, on the overall IR performance. Evaluation was done in-vitro through the IR experiments by individually tuning each of these parameters through a series of benchmarking experiments, geared towards optimizing retrieval precision as well as recall. The results give an insight into the performance of the chosen resource lenient approaches, the challenges, and their impact on the overall IR performance.
 

The defense
June 13, 1-2 pm
Sal C

Oponent: Associate Professor Douglas Oard, College of Information Studies, University of Maryland

Chairman/ supervisor: Docent Lars Asker, DSV, Stockholm University
Examination Committee: Professor Viggo Kann, Department of Numerical Analysis and Computer Science, Stockholm University
Professor Barbara Gawronska, Institutt for fremmedspråk og oversetting, Universitet i Agder, Norway
Professor Henrik Boström, DSV, Stockholm University

About Atelach Argaw
From Ethiopia
Undergraduate level: Chemistry, Physics and mathematics
Graduate level: Computer Science, Information Science
PhD student at Graduate School of Language Technology, Gothenburg University (with link), DSV, Stockholm University