![]() The term generation does not mean that one generation replaces the other, but that one generation is based on the previous ones. The main contribution of the article is a systematic overview of achievements in this research field till now (First, Second, and Third Generation), the open research questions in the present (mainly in the Third Generation) and the requirements that will have to be met for the future development of the area (Fourth Generation). In this article, available data engineering methods for data science applications will be classified. ![]() 3, a comprehensive outlook on open research questions in all generations is given. Each of these generations represents a very active body of research. 2, a classification of data engineering methods will be introduced and four generations will be defined. The rest of the article is structured as follows. This article will introduce a systematic classification of the field. ![]() Under the term democratising of machine learning the requirement has been exposed that lowering entry barriers for domain experts analysing their own data is necessary .Īll above enumerated dimensions have determined the data engineering research landscape. In data science applications, an additional requirement comes up: the wish that domain experts will be able to analyse their own data. To treat incompleteness and vagueness of datasets. To consider the heterogeneity of data, and To be able to store data in different data models (besides the relational data model also considering graph data model, streaming data, JSON data model) and to be able to transform data between these different models, Over time, these systems have been extended and redeveloped among different dimensions: For more than 50 years database management systems have been used to store large amounts of structured data. The tools for the data engineering tasks have a long tradition in the classical database research field. In this process, data are validated, cleaned, completed, aggregated, transformed and integrated. Data engineering components have to read the data from very large data sources in different heterogeneous data formats and integrate the data into the target data format. This article, however, will mainly focus on the data preprocessing part of data science. Even though in recent times the focus has been on artificial neural network algorithms, the entire range of data mining methods also includes clustering, classification, regression, association rules and so on. In this article, we use the term data mining in the broad interpretation synonymous to knowledge discovery in databases which is “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or relationships within a dataset in order to make important decisions” (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). In these dedicated application fields different kinds of data are collected and generated that shall be analysed with data mining methods. “Drowning in Data, Dying of Thirst for Knowledge” This often used quote describes the main problems of data science: the necessity to draw useful knowledge from data and simultaneously the main aim of the data engineering field: providing data for analysis.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |