1. You are here:
  2. DSV
  3. Research
  4. Master thesis proposals

Master thesis proposals

Thesis proposal: Creating a synthetic gold PHI corpus for evaluation and development of de-identification tools.

Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but also for the development of different tools, but is also very sensitive since the free text parts may contain information that might identify the patient. This type of information is commonly referred to as Protected Health Information, PHI. Therefore, methods for de-identifying EPRs, by marking and removing any PHI instances, are needed.
We have previously created both a standard for de-identifying patient records and an annotated subset of patient records in Swedish, the Stockholm EPR PHI gold standard, using the PHI classes Age, Date, Date_Part, First_Name, Last_Name, Health_Care_Unit, Location and Phone_Number. (Velupillai et 2009, Dalianis & Velupillai 2009).  A gold standard is used for measuring the performance of information extraction tools in our case de-identification tools.

The task in this master thesis proposal is first to analyze both manually and automatically the Stockholm EPR PHI gold standard to detect non-annotated PHIs and second to construct a syntethic gold standard containing pseudonymized PHIs. Pseudonymized PHIs mean fictive names, phone numbers, locations, and health care units etc. This will be carried out by removing the PHIs and populate the gold standards  with new PHI instances.  We wish as a final step to make the synthetized Stockholm EPR PHI gold standard available for a wider group of researchers.

Supervisor: Associate professor Hercules Dalianis, hercules@dsv.su.se

References

Dalianis, H. and S. Velupillai. 2009. De-identifying Swedish Clinical Text - Refinement of a Gold Standard and Experiments with Conditional Random Fields. In the Proceedings of The 3rd International Symposium on Languages in Biology and Medicine, Hyatt Regency, Seogwipo-si, Jeju Island, South Korea, November 8-10, 2009, pp 73-81, pdf

Velupillai, S., H. Dalianis, M. Hassel and G. H. Nilsson. 2009. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. International Journal of Medical Informatics (2009), doi:10.1016/j.ijmedinf.2009.04.005, abstract


Thesis proposal: Co-morbidity in translational research – phenotype co-morbidity coding as a first step.

Background
• Health care is complex with different care providers, different specialties, different organizational
levels.
• Diagnostic coding and classification is common and fairly well performed.
• Coding and classification of individuals have received little attention. The Adjusted Clinical Group
(ACG) system developed at Johns Hopkins hospital is an exception where individuals are classified
from a co-morbidity perspective, but based on health economic models.1
• Other models for co-morbidity analysis and clustering are lacking.
Aims
• To develop a model for co-morbidity analysis based on data on co-morbidity from clinical hospital
settings.
• To analyze co-morbidity in clinical hospital settings.
Method
• Cluster of morbidity (co-morbidity) analysis2 of ICD-10 codes3, from Karolinska University Hospital
during 2.5 years.
• From a function system perspective (i.e. circulation, digestion, movement etc.) we nay use International
Classification of Functioning (ICF) as a framework explore frequencies of co-morbidity.
• Categorizing co-morbidity clusters, using different aggregating levels in ICD-10.
• Categorizing by using EPR and free-text analysis clustering methods.
Results
• A model for co-morbidity presentation and categorization
• Top ten clusters and their possible clinical background, with subgrup analysis from a gender and age
perspective.
• Possible hypothesis with new clinical syndromes, i.e. frequent co-morbidity clusters that are not yet
known. Changes over time could be studied in a dataset based on a longer time span.
Conclusion
• Co-morbidity analysis and clustering usint ICD-9 codes in combination with free-text in EPRs is
feasible.
• The following co-morbidity clusters are found and need to be further analysed using translational
research methods including possibly both a proteonomic and genomic approach.
• Exampel on output, http://hudine.neu.edu/
Prerequisites
SQL and Java, Perl or other programming language. Knowledge in Linux/Unix
Supervisors
Associate professor Hercules Dalianis, hercules@dsv.su.se, professor Gunnar Nilsson, gunnar.nilsson@ki.se

References
1) Starfield B, Weiner J, Mumford L, Steinwachs D. Ambulatory care groups: a categorization of diagnoses for research and management. Health Serv Res. 1991;26(1):53-74.
2) Goh K-I, Cusick ME, Valle D, Childs B, Vidal M, Barabási A-L (2007) The Human Disease Network, Proc Natl Acad Sci USA 104:8685-8690, http://www.pnas.org/content/104/21/8685.full
3) WHO. International statistical classification of deseases and related health problems 10th revision. Geneva: WHO; 1993.

 

 

Thesis proposal: Evaluation of preprocessing Natural Language Processing (NLP) software

Many NLP applications rely on several pre‐processing steps, such as stemming or lemmatization, compound splitting, parsing, tagging etc. In Information Retrieval, for instance, such steps have shown increase in system performance. However, such systems are often developed on a domain‐specific text set, and might not show similar results when applied in new domains.
This thesis will investigate and evaluate the performance of one such pre‐processing software, such as a lemmatization software, in one language, on a different domain than the domain the system has been
developed for. For instance, the system could be the CST lemmatizer applied on medical texts.

Thesis proposal: Investigation of documentation keyword use in Electronic Patient Records (EPRs) a case study
Electronic Patient Records may be structured in many different ways. For instance, in a Medical Record System used in hospitals in Stockholm, hospital staff enters information under different keywords. The list of keywords is developed by each clinic, and many keywords may, in practice overlap. Hence, a lot of information is either redundant or difficult to find due to different uses of the keyword.
This thesis will analyse the actual use of keywords in a subset of EPRs, identify which keywords might be redundant, and develop a model for a more generic keywording approach. (Knowledge in Swedish is required)

Thesis proposal: Automatic synonym generation from Electronic Patient
Records (EPRs) written in Swedish

EPRs contain a large amount of information written in free‐text form. Moreover, the domain‐specific vocabulary consists of many different synonyms. Creating a dictionary of synonyms would be beneficial for many information retrieval systems designed for this particular domain.
Automatic synonym generation can be performed by calculating distributional similarity in text collections in different ways. One proposed method would be to apply Random Indexing (there is an available Java package developed by Dr. Martin Hassel here at this department). This thesis would apply Random Indexing on a set of EPRs written in Swedish to develop synonym dictionaries, and evaluate the results.

 

In cooperation with KTH.