Examples of Language Technology at DSV
Language technology (LT), sometimes referred to as human language technology (HLT) or natural language processing (NLP), is an interdisciplinary research field that draws its main bulk of methods and techniques from computer science and linguistics, but also from diverse fields such as statistics, mathematics, pedagogics and psychology. LT is closely related to computational linguistics (CL) but tends to be more application oriented. These applications can be found at the core of many information systems as well as systems for computer mediated communication and learning.
Automatic Text Summarization. With digitally stored information available in abundance and in a myriad of forms, even for many smaller languages, it has now become near impossible to manually search, sift and choose which information one should incorporate. Instead this information must by some means be filtered and extracted in order to avoid drowning in it. Furthermore, genres with specialized and rapidly expanding vocabulary due to innovative language use, such as blog text and electronic patient records, require portable and resource lean summarization methods that can make use of huge amounts of unannotated text for building semantic models. Automatic summarization can be used for varied tasks such as Business Intelligence, overview of document collections and e-mail correspondence, headline generation and the production of medical discharge letters.
Computer Assisted Language Learning (CALL) is a broad field, including many different theoretical and technical approaches to language learning. Language technology is playing an increasingly important part in CALL. It can be used for the analysis of learners’ language with applications such as grammar checking, essay scoring, and tutoring systems. Language technology is also used for analysis and generation of native language use for the creation of stimulating learning materials, exercise generation and linguistically oriented games. The analysis of different corpora is very important for the field, and the corpus material can be used for learning per se, or as content in different systems.
Cross-Lingual Information Retrieval. These techniques allow people to write queries in one language and retrieve relevant documents written in another language. This helps people to submit search queries in their mother tongue, which can be very helpful when it comes to queries related to e.g. medical information. The knowledge base consists of multilingual medical content developed by psychiatrists and psychotherapists from different countries. Users consult the knowledge base submitting queries in natural language, which are then matched against pre-stored FAQ-files (Frequently Asked Questions) consisting of question/answer pairs, where the question part has a template created to match many different variations of the same question. Different combinations of lexicons and search engines are utilized in the experiments in order to test the quality of the lexicons and the search engines.
Email and Short Message Answering. While question answering systems answer single-sentence questions, email answering, or short message answering, delivers answers to messages that contain several sentences. Just like our emails. The task of email answering can be considered similar to the task of document classification, yet there are differences. An email message may contain several questions, which require several answers. Furthermore, some statistical features, such as term frequency or location of relevant words in the document, do not work for email messages. Furthermore, the accuracy of email answering must be higher than that of regular text classification systems (e.g. Support Vector Machine); we would like to have, perhaps, 9 out of 10 messages answered correctly. In order to reach good performance at reasonable cost, our email answering techniques are based on text pattern matching.
Language Modeling and Lexical Semantics. Distributional patterns in corpora can be exploited to build mathematical models of language use that contain information about the relative meaning of linguistic units, typically words. Methods such as Random Indexing can be applied effectively to a wide range of problems involving natural language, e.g. information retrieval, text classification and text summarization. There are, however, challenges involved in using these models in isolation, such as accounting for semantic compositionally, as well as negated and speculative information. A creative application of the word space model to less explored domains, such as the clinical domain, can hopefully contribute to the potential of applying language technology to improve health care and facilitate medical research.
Natural Language Processing of Health Records. Today a huge amount of health records is produced within health care. The records contain valuable information both as structured data and unstructured free text, such as person names, addresses, symptoms, diagnoses, medication, relations between diagnoses (comorbidities), drugs, treatments, effects and side effects of drugs and adverse events. This information can be used to assist the clinician during her daily work when reading and writing records about the patient, for example to obtain an overview of the patient record by automatic text summarisation showing the most relevant part of the patient records, but also to assist clinical researchers and hospital managers to obtain an overview of the health care processes, so called medical business intelligence.
Semantic Information Extraction. For some information extraction needs, accurate, relevant and situation-specific information extraction is crucial. This involves, e.g., distinguishing factual information from speculative or negated information. Creating automated systems that can deal with this task requires linguistic categorization and modeling, which can be used for building rule- or machine learning based tools. Such tools can be incorporated in systems for intelligent information access such as adverse event detection, decision-support or summaries, and used in domains such as health care and business intelligence.
Web Mining is the application of data and text mining techniques to Web related data with the aim to discover structural, usage and content related patterns. Language technology can be used in, for instance, web content mining for grouping users for personalized marketing, analysing the relevance of web pages and identifying trends in news and the bloggosphere. Furthermore, methods developed in forensic linguistics can be used for identifying criminal activities such as grooming in chat rooms and on online forums.
January 5, 2013
Source: DSV Language Technology