Text mining

A considerable amount of medical expert knowledge can be found in different types of natural language text, e.g. in scientific articles, research reports. Hence the utilization of these sources can provide rather important extra information to a research system.

Hence the tasks of text mining are the following.
(1) Building a corpus (collection of literature) about a given research domain.
(2) Processing individual pieces of text, i.e. the exploration and storing of the occurrences of relevant concepts.
(3) The statistical analysis of the data constructed during the above steps, and discovery of relations and associations among domain concepts.

After the examination of textual information has successfully been done, the provided result can be exploited in several further ways.
(1) The resulting domain model can provide important pieces of information on its own right: since it basically describes the relations of the concepts as they can be extracted from the domain literature, it can give a view on the current trends and conceptions.
(2) Since the modeled concepts can be identified with the entities of the “real” model, the results can be used as a starting parameterization of a following learning phase, hence it allows for a normative integration of literature-generated associations into the results of standard examinations.
(3) Comparing the results of standard and literature-based examination can discover how the results of current examinations are related to already documented trends, theories: what they have in common and what possibly new aspects have been explored.

Detailed methodology

(1) Building the corpus.
Elements of the corpus can originate from several sources, such can be:
• Research articles and their abstracts
• Concept definitions
• Medical records and reports

The collection and definition of the set of relevant concepts also belongs to this step, as well as providing further supplementary information (like different forms and synonyms), from what it can be determined whether a given concept occurs in a piece of text or not.

(2) Calculation of occurrence measures and relevance.
Based on the number of concepts occurring in a piece of text and on the number of text pieces in which a concept occurs, for every concept-text pair a score can be calculated, describing how relevant is the given concept in the given text (cf. the tf-idf score).
The resulting relevance table can be used as the main input data of a purely statistical analysis.

(2a) Shallow parsing.
Since the above method, which is purely based on concept occurrences, cannot take into account the inter-textual relations of concepts (e.g. how they are mentioned within a sentence), the grammatical processing of texts.
Such an analysis can directly provide estimations for the associations within concept-concept pairs, hence the results of this can be used in the model-level examination of concept relations.

(2b) Deep parsing
(2c) Manual text annotation

(3) Model-level analysis.
The data provided by the previous steps are now suitable for a model-level Bayesian statistical analysis. This can provide detailed information about the relevance relations of domain concepts, which (beyond that it can be valuable on its own concerning the examined domain) can be utilized for the supporting of a further analysis based on real-world measurement data.