Publications
International conference articles
2018
- Caneill, Matthieu, and Noël De Palma. “λ-Blocks: Data Processing with Topologies of Blocks.” In 2018 IEEE International Congress on Big Data (BigData Congress), 9–16, 2018. https://doi.org/10.1109/BigDataCongress.2018.00009.
→ download pdf, toggle abstract, toggle bibtexWe present and evaluate λ-blocks, a novel framework to write data processing programs in a descriptive manner. The main idea behind this framework is to separate the semantics of a program from its implementation. For that purpose, we define a data schema, able to describe, parameterize, compose, and link together blocks of code, storing a directed graph which represents the data transformations. Along this data schema lies an execution engine, able to read such a program, give feedback on potential errors, and finally execute it. In our reference implementation, a computation graph is described in YAML, linking together vertices of Python code blocks defined in separate libraries. The advantages of this approach are manyfold: faster, less error-prone programming; reuse of code blocks; computation graph manipulations; mixing of different specialized libraries; and finally middleware for potential front-ends (such as graphical interfaces) and back-ends (other execution engines). The main goal of λ-blocks is to bring complex data processing computations to non-specialists, by providing a simple abstraction over large-scale data processing systems. Our contributions lie within a description of the schema, and an analysis of the reference execution engine. For that purpose we describe λ-blocks’ internals and its main abstractions (blocks and topologies), and evaluate the framework performances. We measured the framework overhead to have a maximum value of 50 ms, a negligible amount compared to the average duration of data processing jobs.
- Caneill, Matthieu, Noël De Palma, Ali Ait-Bachir, Bastien Dine, Rachid Mokhtari, and Yagmur Gizem Cinar. “Online Metrics Prediction in Monitoring Systems.” In 2018 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2018.
→ download pdf, toggle abstract, toggle bibtexMonitoring thousands of machines and services in a datacenter produces a lot of time series points, giving a general idea of the health of a cluster. However, there is a lack of tools to further exploit this data, for instance for prediction purposes. We propose to apply linear regression algorithms to predict the future behavior of monitored systems and anticipate downtimes, giving system administrators the information they need ahead of the problems arising. This problem is quite challenging when dealing with a high number of monitoring metrics, given our three main constraints: a low number of false positives (thus blacklisting volatile metrics), a high availability (due to the nature of monitoring systems), and a good scalability. We implemented and evaluated such a system using production metrics from Coservit, a company specialized in infrastructure monitoring. The results we obtained are promising: sub-second latency per predicted metric per CPU core, for the entire end-to-end process. This latency is constant when scaling the system up to 125 cores on 4 machines dedicated for monitoring predictions, and the performances don’t decrease with time: during 15 minutes, it is able to handle more than 100 000 monitoring metrics.
2016
- Caneill, Matthieu, Ahmed El Rheddane, Vincent Leroy, and Noël De Palma. “Locality-Aware Routing in Stateful Streaming Applications.” In Proceedings of the 17th International Middleware Conference, 4:1–4:13. Middleware ’16. ACM, 2016. https://doi.org/10.1145/2988336.2988340.
→ download pdf, toggle abstract, toggle bibtexDistributed stream processing engines continuously execute series of operators on data streams. Horizontal scaling is achieved by deploying multiple instances of each operator in order to process data tuples in parallel. As the application is distributed on an increasingly high number of servers, the likelihood that the stream is sent to a different server for each operator increases. This is particularly important in the case of stateful applications that rely on keys to deterministically route messages to a specific instance of an operator. Since network is a bottleneck for many stream applications, this behavior significantly degrades their performance. Our objective is to improve stream locality for stateful stream processing applications. We propose to analyse traces of the application to uncover correlations between the keys used in successive routing operations. By assigning correlated keys to instances hosted on the same server, we significantly reduce network consumption and increase performance while preserving load balance. Furthermore, this approach is executed online, so that the assignment can automatically adapt to changes in the characteristics of the data. Data migration is handled seamlessly with each routing configuration update. We implemented and evaluated our protocol using Apache Storm, with a real workload consisting of geo-tagged Flickr pictures as well as Twitter publications. Our results show a significant improvement in throughput.
2014
- Caneill, Matthieu, and Stefano Zacchiroli. “Debsources: Live and Historical Views on Macro-Level Software Evolution.” In ESEM 2014: 8th International Symposium on Empirical Software Engineering and Measurement. ACM, 2014. https://doi.org/10.1145/2652524.2652528.
→ download pdf, toggle abstract, toggle bibtexContext. Software evolution has been an active field of research in recent years, but studies on macro-level software evolution—i.e., on the evolution of large software collections over many years—are scarce, despite the increasing popularity of intermediate vendors as a way to deliver software to final users. Goal. We want to ease the study of both day-by-day and long-term Free and Open Source Software (FOSS) evolution trends at the macro-level, focusing on the Debian distribution as a proxy of relevant FOSS projects. Method. We have built Debsources, a software platform to gather, search, and publish on the Web all the source code of Debian and measures about it. We have set up a public Debsources instance at http://sources.debian.net, integrated it into the Debian infrastructure to receive live updates of new package releases, and written plugins to compute popular source code metrics. We have injected all current and historical Debian releases into it. Results. The obtained dataset and Web portal provide both long term-views over the past 20 years of FOSS evolution and live insights on what is happening at sub-day granularity. By writing simple plugins ( 100 lines of Python each) and adding them to our Debsources instance we have been able to easily replicate and extend past empirical analyses on metrics as diverse as lines of code, number of packages, and rate of change—and make them perennial. We have obtained slightly different results than our reference study, but confirmed the general trends and updated them in light of 7 extra years of evolution history. Conclusions. Debsources is a flexible platform to monitor large FOSS collections over long periods of time. Its main instance and dataset are valuable resources for scholars interested in macro-level software evolution.
International journal articles
2016
- Caneill, Matthieu, Daniel M. Germán, and Stefano Zacchiroli. “The Debsources Dataset: Two Decades of Free and Open Source Software.” Empirical Software Engineering, 2016. https://doi.org/10.1007/s10664-016-9461-5.
→ download pdf, toggle abstract, toggle bibtexWe present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.
Miscellaneous
2018
- Caneill, Matthieu. “Contributions to Large-Scale Data Processing Systems.” PhD thesis, Univ. Grenoble Alpes, 2018.
→ download pdf, toggle abstract, toggle bibtexThis thesis covers the topic of large-scale data processing systems, and more precisely three complementary approaches: the design of a system to perform prediction about computer failures through the analysis of monitoring data; the routing of data in a real-time system looking at correlations between message fields to favor locality; and finally a novel framework to design data transformations using directed graphs of blocks. Through the lenses of the Smart Support Center project, we design a scalable architecture, to store time series reported by monitoring engines, which constantly check the health of computer systems. We use this data to perform predictions, and detect potential problems before they arise. We then dive in routing algorithms for stream processing systems, and develop a layer to route messages more efficiently, by avoiding hops between machines. For that purpose, we identify in real-time the correlations which appear in the fields of these messages, such as hashtags and their geolocation, for example in the case of tweets. We use these correlations to create routing tables which favor the co-location of actors handling these messages. Finally, we present λ-blocks, a novel programming framework to compute data processing jobs without writing code, but rather by creating graphs of blocks of code. The framework is fast, and comes with batteries included: block libraries, plugins, and APIs to extend it. It is also able to manipulate computation graphs, for optimization, analysis, verification, or any other purposes.
2010
- Caneill, Matthieu, and Jean-Loup Gilis. “Attacks against the WiFi Protocols WEP and WPA,” 2010.
→ download pdf, toggle abstract, toggle bibtexWireless networks are today an entire part of the Internet, and are often used by companies and particularies. Security of information is thus important, and protocols like WEP and WPA can be attacked. We present in this report the existent WLANs protocols, an overview of the most efficient attacks on them and two attacks we have found on WEP.