Subject and objective of the study
In a narrower sense, data mining refers to the increasingly automated analysis of data sets to gain information about patterns, trends or correlations, among other things. The societal challenges become particularly apparent when data mining is viewed as a process that also includes the task definition upstream of the analysis, the selection and preparation of the data, and the downstream validation and use of the results. This is because, on the one hand, data mining increasingly involves the reuse of large databases that were primarily created in one context in order to search for patterns in other contexts. This raises questions about access to existing databases and about the possibilities and limits of further use. On the other hand, the particular potential of structure-recognising data analysis lies in the generalisation and reuse of sufficiently valid results. Depending on the task, not only information can be gained, but also decision rules can be derived, mathematical-statistical models can be adapted to the data, or algorithms can be trained in order to use them in new situations with the same facts and to at least support decisions. Such data-driven decision (support) systems are seen as having innovation potential in almost all areas of life. At the same time, concerns are expressed about non-transparent procedures and unequal opportunities for exploitation. Fears extend to the end of privacy or the uncontrollability of algorithmic systems.
The study on data mining commissioned by the Committee for Education, Research and Technology Assessment focuses on the following issues
- The possibilities and limits of access to and use of data sets collected and stored primarily in the context of public tasks,
- the analytical techniques involved in data mining and
- legal foundations that define the possibilities and limits of data analysis, partly in general, partly specifically for public tasks.
The areas of medicine and health care will be examined in more detail, as structure-recognising data analysis is often seen as having particular potential in these areas.
The aim of the study is to open up the generic term data mining from different perspectives, to present it in its complexity and to illustrate current possibilities and challenges by means of different application examples. The aim is to improve the understanding of structure-recognising data analysis methods. TAB Working Report No. 203, which concludes the study, provides a broad and comprehensive basis of information and complements numerous statements made by other institutions and bodies in recent years on digitisation in general and on big data and artificial intelligence in particular.
Contents of the report
Data mining from an analytical and technical perspective
The introductory technical chapter 2 deals with the data sets required for data mining with their structures, references and increasingly standardised representations (keywords: interoperability and machine readability), their storage and provision, as well as the associated structure-recognising analysis procedures and their results. Depending on the task, information about similarities or differences of data objects can be obtained and related knowledge can be extended, rules can be derived, mathematical-statistical models can be created or algorithmic systems can be trained, which can be used, for example, for the classification of new objects or for prognostic purposes to at least support decisions. From a technical point of view, data mining is a process that begins with the definition of the data analysis problem to be solved and regularly requires extensive data preparation. Even though many data analysis processes can be largely automated by means of algorithms, considerable expertise is still required to prepare the data analyses, to assess the validity of the results and to monitor and control the processes as a whole.
Legal and normative aspects
Chapter 3 deals with fundamental rights of individuals and legally defined rights to data content and data files, their scope and limits, as well as data processing possibilities and related obligations. In particular, it deals with the data provision activities of public institutions and the possibilities of data analysis for tasks in the public interest and for scientific research purposes. Based on the public geoinformation system, which is considered as a pioneer in data standardisation and increasingly open provision for any further use, the long-term activities for the development of the national spatial data infrastructure are outlined. Improvements in access to geodata are sometimes counterbalanced by uncertainties in the permissibility and risk assessment of individual analysis projects. The already established data trust structures for the re-use of personal data and the need for prior impact assessments could become more important for data mining activities in general. Reference is made to the first approaches to risk-adapted regulation. They are based on established procedures in medical device law.
Data mining in medicine and healthcare
One focus of Chapter 4 is on medical data, their origins, diversity, protection and decentralised, little standardised primary storage, as well as their complexly regulated, limited and costly aggregation in medical registers and data centres, which make them available as data trustees for narrowly defined research purposes after application review with benefit-risk considerations. Examples are used to show that structure-recognising data analysis methods have been used in medicine for a long time. The resulting algorithmic systems, which at least support treatment-relevant decisions, are covered as software by the law on medical devices. The risk-adjusted quality management system they establish and the proof of benefit required by the national healthcare system place high demands on product development.
The reimbursement system for medical services is based on individual case accounting using legally defined, machine-readable data records that provide a complete and highly granular picture of treatment events in the national healthcare system. Chapter 5 presents the various institutions of the health care system with their specific tasks, data sets and possibilities for data analysis. For many years, there has been a debate about how this data can be brought together in trust structures and used more intensively for scientific purposes. Further application examples are used to illustrate the efforts, challenges and limitations of data mining approaches in the health care system.
Conclusions and options for action
The societal challenges become visible especially when considering extended data mining processes and concern the technical and legal possibilities and limits of data re-use as well as the handling of the results of structure-recognising data analysis processes. From this perspective, the term data mining overlaps considerably with the buzzwords big data or artificial intelligence.
Numerous expert councils and commissions, including those of the German Bundestag and the German government, have dealt with this topic and unanimously recommend accelerating digitisation activities, expanding infrastructures for the further use of data stocks, focusing more strongly on data use, strengthening data analysis know-how, promoting the development of corresponding applications, regulating high-risk applications more strongly, as well as striving for greater national or European digital sovereignty, also in order to ensure high standards of protection and the safeguarding of fundamental rights. The TAB report supports these recommendations.
By taking an in-depth and comparative look at a number of public applications, the report identifies specific features, strengths and focal points from which further application-related options for action can be derived:
The public geodata sector is exemplary in terms of data provision, and greater attention should be paid to data use.
The spatial data sector is considered a pioneer in the development and use of data standards and in the establishment and expansion of the national spatial data infrastructure. Other sectors could benefit from this experience. This is because the analytical potential of geospatial data increases as more data from other sectors is made available in a standardised and geo-referenced form. There is some legal uncertainty about the risk assessment of high resolution geospatial analysis. In the future, more emphasis should be placed on the use of high-resolution geodata in particular.
In the medical field, risk-benefit considerations are part of many data mining processes, but interoperability of and access to medical data need to be promoted.
Medical data is considered to be poorly standardised. It is stored in a decentralised manner, is subject to the highest standards of protection and has very limited reusability due to the low interoperability of primary storage systems. Access to these datasets for health-related issues should urgently be promoted. This will require the development and implementation of standards, improved interoperability of primary data management systems, the development of data infrastructures and the revision of access procedures.
The medical and health care sectors have had years of experience with data trust procedures, risk assessment and quality assurance, including for high-risk algorithmic systems, from which other sectors could potentially benefit. However, the certification of such algorithms and their integration into medical care is costly and time-consuming. Certification procedures should be made more efficient.
The scope of the research privilege for data mining needs to be discussed.
Data analysis, including data mining for research purposes, is legally privileged in different ways. The wording in different laws allows for different interpretations. A public debate should be held on the scope of the concept of research in the re-use of existing data.
2022. Büro für Technikfolgen-Abschätzung beim Deutschen Bundestag (TAB). doi:10.5445/IR/1000156886
2022. Büro für Technikfolgen-Abschätzung beim Deutschen Bundestag (TAB). doi:10.5445/IR/1000156297
- The policy brief TAB-Fokus no. 40: Data mining - sociopolitical and legal challenges (in English) will be available in April 2023
In the Bundestag
- Vorgang - Bericht, Gutachten, Programm im Dokumentations- und Informationssystem für Parlamentsmaterialien (DIP)
Beratung über Technikfolgenanalyse zu Data-Mining . Plenardebatte zum TAB-Bericht „Data-Mining – gesellschaftspolitische und rechtliche Herausforderungen“ (Article and recording of the 40-minute debate in the German Bundestag on April 21, 2023.)
Further reading on the topic
Ferdinand, J.-P.; Kind, S.
2018. Büro für Technikfolgen-Abschätzung beim Deutschen Bundestag (TAB). doi:10.5445/IR/1000133904
Kind, S.; Weide, S.
2017. Büro für Technikfolgen-Abschätzung beim Deutschen Bundestag (TAB). doi:10.5445/IR/1000133902