Project manager: Hercules Dalianis.
PhD students: Sumithra Velupillai, Maria Skeppstedt.
Additional participant: Martin Hassel, Gunnar Nilsson, Andreas Amsenius
Collaboration partners: Stockholms läns landsting, Karolinska universitetssjukhuset.
Funding: Vinnova.
Project period: 1 November 2006 – 31 December 2009 

Project description

Nowadays huge volumes of text are produced which are stored in digital form. The challenge we face is to find the information that we need, and to identify connections within this large volume of data. The objective of the Knowledge Extraction Agent (KEA) project is to discover the links within the extensive information contained in electronic patient records.

The aim of the KEA project is to identify new and hidden relations between the symptoms, diagnoses, medication, social situation, age and gender, etc. of patients using a large database that contains more than a million patient records with both structured and unstructured information, mainly free text.

The Department of Computer and system sciences (DSV) is cooperating with Stockholm County Council and the Karolinska University Hospital on the KEA project.

Stockholm County Council has a large patient record system covering all the clinics and hospitals in the county. DSV has been given access to one million of these records (without social security numbers and names). The records contain large amounts of unstructured text which in principle is never reused. By developing programs which automatically structure the information in the records, obvious and hidden connections between the different texts can be identified.

Using computational linguistic methods it is possible to generate new knowledge which can be applied in medical research, for example in the case of diagnoses that are difficult to evaluate.

Anonymising patient records

A fundamental issue for the KEA project involves making the information in the patient records accessible without running the risk of revealing the patients' identities.

The patient records contain a lot of very sensitive information which must not be disclosed to others. Therefore, the KEA project has developed a preliminary standard for anonymising patient records written in Swedish. The program used has been designed to recognise Swedish names, addresses, telephone numbers, e-mail addresses and other information which could reveal the patient's identity. After the records have been anonymised, they can be made available for a larger group of researchers.

Planned experiments

Certainty and uncertainty in patient records

The records contain an Assessment field in which medical staff can enter information about the patient's condition and treatment. As a result, the assessment field often contains uncertain or speculative expressions. DSV has annotated 8 000 randomly selected sentences from the patient records. These notes highlighted a large number of speculative and uncertain remarks. When the notes are analysed in more detail, it will be possible to develop tools which can identify expressions of this kind automatically.

Help in finding the right ICD code

A common problem for medical staff is choosing the right ICD-10 (International Classification of Diseases) code and finding their way through the list of more than 35 000 codes. One solution is to develop a software program which suggests ICD codes based on the written descriptions of symptoms or the diagnosis.

The KEA project has also developed a preliminary system which can link the text in the patient records with the diagnoses in the ICD-10 codes in order to identify the symptoms and terms used in connection with the codes. The system can also confirm that the right code has been selected.

Synonyms in patient records

Many different synonyms are used in the patient journals to describe the same symptoms or illnesses. The KEA project has created dynamic lists of these synonyms which can be used to produce new terms and develop guidelines for the existing terminology.

The lists of synonyms can also be used to make the patient records more accessible and understandable for patients. In the patient's version of the record, medical terminology can be replaced with more colloquial terms.

Generating hypotheses

The large volume of information in text form in the patient records is still relatively unexplored. Using text mining it is possible to generate new hypotheses about the connections between different factors which influence medical care and health. Text mining is the process of identifying meaningful and previously unrecognised patterns and connections in unstructured data.

The hypotheses "farmers smoke less than the average" and "women suffer more from osteoporosis than men" have already been generated using the text mining process. After they have been created, the hypotheses must be tested and confirmed by means of other types of investigations. The results of generating hypotheses from the text in the patient records can form the starting point for more extensive studies into the connections between different factors, such as gender, age, occupation and illnesses.


Dalianis, H. and S. Velupillai. 2009. De-identifying Swedish Clinical Text - Refinement of a Gold Standard and Experiments with Conditional Random Fields. In the Proceedings of The 3rd International Symposium on Languages in Biology and Medicine, Hyatt Regency, Seogwipo-si, Jeju Island, South Korea, November 8-10, 2009.

Dalianis, H, G.H. Nilsson and S. Velupillai. 2009. Is De-identification of Electronic Health Records Possible? OR Can We Use Health Record Corpora for Research? Panel at Virtual Healthcare Interaction - VHI 09, Association for the Advancement of Artificial Intelligence, AAAI 2009 Fall Symposium Series, Technical Report FS-09-07, Westin Arlington Gateway, Arlington, VA, USA November 4-7, 2009.

Dalianis, H., M. Hassel and S. Velupillai. 2009. The Stockholm EPR Corpus - Characteristics and Some Initial Findings. In the Proceedings of ISHIMR 2009, Evaluation and implementation of e-health and health information initiatives: international perspectives. 14th International Symposium for Health Information Management Research, Kalmar, Sweden, 14-16 October, 2009.

Velupillai, 2009. Swedish Health Data - Information Access and Representation. Licentiate thesis, Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden.

Velupillai, S., H. Dalianis, M. Hassel and G. Nilsson. 2009. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. International Journal of Medical Informatics (2009),

Velupillai, S., H. Dalianis and M. Hassel. Diagnosing Diagnoses in Swedish Clinical Records, in the Proceedings of The First Conference on Text and Data Mining of Clinical Documents, Karsten, H., B. Back, T. Salakoski, S. Salanterä and H. Suominen (Eds.) Turku, Louhi'08, September 3-4, 2008, pp. 110-112




Project Manager:
Hercules Dalianis

Other participants:


KEA participated in Virtual Healtcare Interaction, Arlington USA

Read more

Master thesis proposals

Five master thesis proposals in the domain of human language technology and Electronic Patient Records (EPR)

Read more

Collaboration partners

Stockholms läns landsting
Karolinska Universitetssjukhuset

Funder: Vinnova>>


Associate professor Hercules Dalianis