Knowledge Extraction Agent project
Project duration: 1 November 2006 – 31 December 2009
Nowadays huge volumes of text are produced which are stored in digital form. The challenge we face is to find the information that we need, and to identify connections within this large volume of data. The objective of the Knowledge Extraction Agent (KEA) project is to discover the links within the extensive information contained in electronic patient records.
The Department of Computer and system sciences (DSV) is cooperating with Stockholm County Council and the Karolinska University Hospital on the KEA project.
Stockholm County Council has a large patient record system covering all the clinics and hospitals in the county. DSV has been given access to one million of these records (without social security numbers and names). The records contain large amounts of unstructured text which in principle is never reused. By developing programs which automatically structure the information in the records, obvious and hidden connections between the different texts can be identified.
Using computational linguistic methods it is possible to generate new knowledge which can be applied in medical research, for example in the case of diagnoses that are difficult to evaluate.
Anonymising patient records
A fundamental issue for the KEA project involves making the information in the patient records accessible without running the risk of revealing the patients' identities.
The patient records contain a lot of very sensitive information which must not be disclosed to others. Therefore, the KEA project has developed a preliminary standard for anonymising patient records written in Swedish. The program used has been designed to recognise Swedish names, addresses, telephone numbers, e-mail addresses and other information which could reveal the patient's identity. After the records have been anonymised, they can be made available for a larger group of researchers.
Planned experiments
Certainty and uncertainty in patient records
The records contain an Assessment field in which medical staff can enter information about the patient's condition and treatment. As a result, the assessment field often contains uncertain or speculative expressions. DSV has annotated 8 000 randomly selected sentences from the patient records. These notes highlighted a large number of speculative and uncertain remarks. When the notes are analysed in more detail, it will be possible to develop tools which can identify expressions of this kind automatically.
Help in finding the right ICD code
A common problem for medical staff is choosing the right ICD-10 (International Classification of Diseases) code and finding their way through the list of more than 35 000 codes. One solution is to develop a software program which suggests ICD codes based on the written descriptions of symptoms or the diagnosis.
The KEA project has also developed a preliminary system which can link the text in the patient records with the diagnoses in the ICD-10 codes in order to identify the symptoms and terms used in connection with the codes. The system can also confirm that the right code has been selected.
Synonyms in patient records
Many different synonyms are used in the patient journals to describe the same symptoms or illnesses. The KEA project has created dynamic lists of these synonyms which can be used to produce new terms and develop guidelines for the existing terminology.
The lists of synonyms can also be used to make the patient records more accessible and understandable for patients. In the patient's version of the record, medical terminology can be replaced with more colloquial terms.
Generating hypotheses
The large volume of information in text form in the patient records is still relatively unexplored. Using text mining it is possible to generate new hypotheses about the connections between different factors which influence medical care and health. Text mining is the process of identifying meaningful and previously unrecognised patterns and connections in unstructured data.
The hypotheses "farmers smoke less than the average" and "women suffer more from osteoporosis than men" have already been generated using the text mining process. After they have been created, the hypotheses must be tested and confirmed by means of other types of investigations. The results of generating hypotheses from the text in the patient records can form the starting point for more extensive studies into the connections between different factors, such as gender, age, occupation and illnesses.
Publications>>
Participants
Project Manager: Hercules Dalianis, Associate professor
Other participants:
- Martin Hassel, Fil Dr.
- Gunnar Nilsson, Guest professor
- Sumithra Velupillai, PhD student
- Maria Skeppstedt, PhD student
News
KEA participated in Virtual Healtcare Interaction, Arlington USA
Master thesis proposals
Five master thesis proposals in the domain of human language technology and Electronic Patient Records (EPR)
Collaboration partners
Stockholms läns landsting
Karolinska Universitetssjukhuset
Funder: Vinnova>>
Contact
hercules@dsv.su.se



