In conjunction with the
22nd European Conference on Advances in Databases and Information Systems (ADBIS 2018)
September 2-5, 2018 – Budapest, Hungary
|Overview||Topics||Submissions||Chairs and Committee||Program|
Simona E. Rombo is Assistant Professor in Computer Science at the Department of Mathematics and Computer Science of the University of Palermo (Palermo, Italy) since September 2012. Her main research activities focus on Bioinformatics, Algorithms and Data Structures, Data Mining. She co-authored more than 50 scientific publications on established journals and conferences in these fields. In particular, she proposed novel approaches for the alignment, querying and clustering of biological networks, pattern discovery from biological sequences and digital images, compression and classification of digital images, epigenomics. She coordinated research groups as the Principal Investigator of both national and international research projects, mainly on big data management in the biological and biomedical contexts, financed by the Italian Ministry of Education, University and Research; the National Institute of High Mathematics F. Severi; Microsoft Azure. She contributed with her research to many other national and international research grants. She was the Managing Guest editor of a special issue for Theoretical Computer Science, she is on the Editorial Board of other international journals (e.g., International Journal of Big Data, and others) and she serves as a member of the Program Committee for several international conferences and workshops (e.g., GECCO, EvoBio, WABI, and others). She was invited lecturer and/or visiting scientist at several prestigious institutes, such as Oxford University (Oxford, UK), National Institute of Health (Bethesda, USA), Georgia Institute of Technology (Atlanta, USA), Purdue University (West Lafayette, USA).
Current high-throughput technologies produce large collections of data, such as DNA sequences with additional information and cellular interaction data. Moreover, in the last few years a large amount of functional annotations and genotype-phenotype associations have been collected and stored in public databases. This deluge of heterogeneous and, often, non-structured data opens of course new challenges but, at the same time, it offers great opportunities to bring light on important and unsolved biological and biomedical issues. We will first provide an introductory overview on the use of metadata in this context, and then focus on two main topics which involve both network and sequence data: (1) a big-data based framework for functional network data integration, and (2) the construction of epigenomics k-mer dictionaries for the study of nucleosome positioning. Finally, we will discuss the main open issues in the biological and biomedical domain, with regards to metadata and data integration in the big data era.
Rihan Hai has been working as a research assistant at the Chair of computer science 5 (Information Systems) at RWTH Aachen University since May 2015. Before that she was a technical consultant in SAP and worked on system integration projects. Her research focuses on big data integration systems (e.g., Data Lakes), especially metadata management of heterogeneous data. In RWTH Aachen She has been working on developing a Data Lake prototype Constance, which ingests relational, semi-structured and unstructured data in raw format, and performs data integration without data transformation overhead. The technical solutions applied in Constance has been published in several international conferences (e.g., ADBIS, ER, SIGMOD), journals and book chapters. She also works on different projects covering various applications of Data Lake systems
Talk Abstract: As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. However, valuable insights are often only available upon the combination and integrated analysis of information in these silos. To meet this gap, Data Lake systems have been proposed recently as a solution to this problem. Data Lakes collect data from heterogeneous sources in its original format, and provide functions to extract metadata from sources. As schema information, mappings, and other constraints are not defined explicitly or required initially for a Data Lake, it is important to extract as much metadata as possible from the data sources during the ingestion phase. Metadata management is crucial for data reasoning, query processing, and data quality management. Without any metadata, the DL is hardly usable as the structure and semantics of the data are not known, which turns a Data Lake quickly into a ‘data swamp’. The talk will give an overview of recent works, use case and research challenges of metadata management in Data Lakes. In particular, the talk will elaborate on several key challenges in Data Lakes: metadata extraction, schema summary and schema mapping over heterogeneous data.
At RWTH Aachen University and the Fraunhofer-Institute for Applied Information Technology (FIT), we are currently developing a Data Lake system in which metadata management governs the data ingestion and integration process in a Data Lake and thereby avoid that the Data Lake turns into a data swamp. The talk will also cover the current design of our Data Lake system, the major components, and the main functions for metadata management.