Ustream recorded and live streamed all conference talks, from which we received a HD version and I cut into pieces -- so that you can rewatch the talks at any time. But don't forget: the goal of the satRdays series is to enable networking among R users, so make sure to attend the next conference instead of waiting for the videos to be uploaded :)
Thanks to our awesome speakers, almost all conference talks and workshop materials, including the slides, are now available. You can also rewatch the whole conference in the archive of the live stream, but the HD version of the talks will be also soon uploaded here -- working on the final cuts right now.
We had a fantastic, although pretty crowded and busy conference last week! A quick retrospect was just published with a few photos on the R Consortium blog. Further photos will be also uploaded here soon, until then, please check the pictures posted on Twitter.
Only a few days left until the conference \o/ Please find some updated information below on the venue, weather, local policies etc and the social events. And some further exciting news: all talks between 10am and 8pm will be live streamed for those who cannot make it in person. But we are looking forward to meeting you soon!
We originally expected 150 attendees, but having reached that milestone a few weeks ago, decided to look for extra funding to allow an extra 20% persons to sign up, but now we are officially sold out with 180 attendees and cannot issue any further tickets.
We are extremely excited to publish the final list of talks and prelimenary schedule of the conference. As this being a community-driven conference, we are looking forward to your feedback!
Our gold sponsor provided an interesting dataset on the flights to and from BUD between 2007 and 2012, which is to be used at our first Data Visualization Challenge. Submission are due to Aug 31 2016, and you can apply with a single plot, full-blown dashboard or any other visualization project to win valuable prices. Dataset last updated on Aug 20.
We are extremely grateful to all our sponsors: their financial support (paying for 3/4 of the overall costs) and commitment were essential to bring this event to life!
The early bird period closed with a great success: sold ~90 percent of the originally planned 150 tickets.
Although we already have an impressive number of exciting workshop, regular/lightning talk and poster submissions, we extend the Call for Papers deadline by a week (July 10 2016) due to the numerous related requests.
The registration form is now open with extremely affordable early-bird tickets until July 15.
The abstract submission form is now open until July 3 -- please submit your proposals on workshops, regular or lightning talks and posters.
We are extremely happy to announce the first two confirmed speakers of the conference: Gabor Csardi and Jeroen Ooms will give the keynote talks on trending and important R topics at the first satRday event.
The satRdays are SQLSaturday-inspired, community-led, one-day, regional and very affordable conferences around the world to support collaboration, networking and innovation within the R community.
The first satRday conference will be held in Budapest, Hungary with support from the founders and organizers of the Budapest Users of R Network and financial support from the R Consortium and below sponsors.
Our main goal with this conference is to
If you want to get in touch with the local organizing committee, please feel free to mail us:
We thank all our generous sponsors for supporting this conference -- their financial help and great commitment to the R community is highly appreciated and was essential to bring this event to life! Please find below the list of our partners per sponsorhip level, and we kindly ask you to visit their homepages to get some quick insights on what are they doing and how they use R:
Please find below the most important milestones of the conference based on the prelimenary schedule:
|Workshop Submissions Deadline|
|Abstract Submissions Deadline|
|Notification of Acceptance||2016-07-13|
|Early-Bird Registration Deadline||2016-07-15|
|Dashboard Competition Project Submission Deadline||2016-08-31|
To minimize the financial barriers of attending to this satRday event, we decided to keep the registration fees as low as possible and we are very happy to announce the below fee structure that is supposed to be affordable to even students and other interested parties paying for the registration on their own:
|Early bird registration (until July 15 2016)||3,000 HUF (<10 EUR)||3,000 HUF (<10 EUR)||6,000 HUF (<20 EUR)|
|Standard registration (until Aug 27 2016)||5,000 HUF (~16 EUR)||5,000 HUF (~16 EUR)||10,000 HUF (~30 EUR)|
|Late and on-site registration||Not available (sold out).|
Registering for the event and purchasing a ticket entitles you to attend a workshop in the morning, all conference talks, a lunch at noon and 2 coffee breaks with no hidden costs. VAT included. You can pay by PayPal (including easy payment options with credit/debit card and wire transfer), but please get in touch if you need any special assistance with the payment, invoice etc.
Why would you wait any longer? Register for the event today -- the number of available spots are limited!Sorry folks, this event is solds out.
We are extremely happy to announce that we will have two fantastic keynote speakers at the conference -- both are very prolific R package developers, with almost 70 packages published by two of them on CRAN:
Please find below the not yet complete list of speakers, which is to be updated on a regular basis after sorting out some of the logistics:
There is no separate extra fee for attending workshops, you just need to register for the conference. We encourage everyone to attend so that we take full advantage of our capacity. The list of planned 1.5 hours long workshops/tutorials:
|A systematic approach to data cleaning with R||Mark van der Loo|
There are two data-cleaning related factoids that are quoted often on the web: 80% of the data is unstructured, and about the same fraction of time is spent on data cleaning. Although there is little systematic research to substantiate these claims, most data analysts will agree that getting data 'clean' for analyses is both essential and time-consuming.
In this 90 minute tutorial we develop a systematic view on data cleaning. Using the concept of a 'statistical value chain' as starting point, we will see that data cleaning has a natural place in a statistical analyses and that data cleaning can be thought of as a two-step process. In the first step one ensures the correct technical representation of a data set (variable type, text encoding, identifiability of a value, etc.). The second step is about ensuring that data meets expectations from domain knowledge (e.g. ages must be positive, an under-aged person cannot have an income from work, etc.).
In this tutorial we will demonstrate, using practical examples, several tools and R packages that can be used to solve both technical and content-related issues. Topics include but are not limited to string processing and approximate text matching, date/time conversion and knowledge-based data validation and correction. We also touch upon how to measure and visualize data quality and the effects of data cleaning on statistics.
|Scalable Machine Learning with H2O||Jo-fai (Joe) Chow|
The focus of this tutorial is scalable machine learning using the H2O R package. H2O is an open source, distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, REST/JSON, and also through a web interface.
Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of Generalized Linear Models, Gradient Boosting Machines, Random Forest, Deep Neural Nets, dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), anomaly detection methods, among others. The ability to create stacked ensembles, or ""Super Learners"", from a collection of supervised base learners is provided via the h2oEnsemble R package.
R scripts/notebook with H2O machine learning code examples will be demoed live and made available on GitHub for attendees to follow along on their laptops.
|Data manipulation the #rdatatable way||Arun Srinivasan|
The data.table R package provides fast and memory efficient ways for data manipulation together with a flexible and consistent syntax. It was first released to CRAN in 2006. It has had over 30 releases since then with the current latest stable release being v1.9.6. Over 240 CRAN and Bioconductor packages now import/depend on data.table. Its StackOverflow tag has attracted >4000 questions from users in many fields making it a top 3 asked about R package. It is the 8th most starred R package on Github.
In this tutorial, we will learn data.table by doing, i.e., by looking at commonly occurring data manipulation questions (based on StackOverflow R tag) and using data.table's syntax and features to solve them. Depending on time availability, we might compare/contrast them to base R / other packages.
Briefly, we will look at problems that cover the following features:
Familiarity with base R is essential, and with SQL is advantageous but not essential. NOTE: It would be extremely advantageous for the participants to go through the "Introduction to data.table" vignette completely. The link to vignettes is provided at the bottom of this page.
R (preferably latest version, for consistency) and latest CRAN version installed.
|Intro to Shiny||Kate Ross-Smith|
An introductory workshop that takes delegates from little or no knowledge of shiny to being able to create your own app and RStudio add-in.
We will focus on what you need to get started, and how keep to good coding practice.
You do not need to have any previous knowledge or experience of shiny.
|Advanced Shiny dashboard topics||Herman Sontrop|
|Joy of ggplot2||Vincent Warmerdam|
GGplot2 is such a great plotting tool that people consider using R just for the library. In this session we will demonstrate why it is such a great tool for plotting.
We will emphasis the joy of working with ggplot by emphasising how visualisations with ggplot have the ability to surprise you. In this session we will start by doing a very basic analysis and we will conclude with the power of having a grammar for visualisation instead of just a mere plotting library. All code will be available afterwards.
Visualisations have the power to surprise you during your day to day work. At the end of the session I will give examples of how proper use of ggplot during your analysis makes you a better analyst.
Please note that no computers will be provided at the workshops, so we recommend bringing your own computer.
Early registration for the workshop attendees
Registration and coffee break
R packages and the infrastructure supporting them are key to R's great success. In this talk I will share episodes from their history, discuss their present status and show some of the ongoing developments, through my personal experiences.
I will try to convince you that much of R's success can be attributed to two aspects: robust open APIs and continuous integration. They are both vital for the adoption, or even survival, of any software product and historically R was much ahead of its time in both. I will show several examples for R's great APIs and testing infrastructure and also discuss some ideas for improving them.
R infrastructure session
Continuous integration, Docker, openCPU, shiny server, RStudio server, Microsoft R Server, R in Hadoop, your laptop, your grid computer ... There's a lot of infrastructure out there and someone has to configure it. This session whistle-stops through networking, Linux basics, and other key concepts to help get you up to speed because one day, the person configuring the server could be you.
R and Python are your secret poweR tools for data RockstaRdom. Together, these tools equip enterprises to deliver real, robust predictive analytics. We will look at how these PoweR tools can complement each other in practical situations.
This is an intensive, practical session with relatable examples to empower business intelligence professionals to move into the realm of predictive analytics, making the ability to deliver enterprise, actionable analytics accessible to everyone.
R is the most popular statistical programming language, partly because its capability of a lot - and still increasing number - of other tasks. The talk will not care about statistical analysis, but will care about automatization, reading and uploading from and to FTP and SFTP server, reading and editing MySQL locally or through SSH, platform-independent development. Tries, cathes, roads to (sub)optimal solutions.
I will present my experience in developing an automated fraud detection system, which shares data with the insurance company through an FTP server. After downloading, it identifies customers using basic personal information and company informations. The next (relatively easy) step is computing different features used for scoring claims, then create a conditionally formatted, multiple-sheet excel file with data about the freshest claims, which will be sent as e-mail attachment to the insurance company. All results will be uploaded to an FTP server for further processing by an interactive reporting system.
Be prepared for system calls, MySQL query asssembling, graph creating, memory management and some data.table magic.
Session will show the Usage of R Language in SQL Server 2016 database system with it's T-SQL language support. We will cover using simple to multivariate statistics against typical transactional database data, log data and monitoring data. Are you data analyst or DBA or developer? Join and learn how to combine both worlds.
R packages session
Data manipulation operations such as subsets, joins, aggregations, updates etc. are all inherently related. By keeping these related operations together, data.table’s syntax provides powerful set of features and enables fast and memory efficient data analysis. In this talk, I will discuss the philosophy behind data.table’s syntax and showcase how its unique features allow for straightforward code while highlighting the new features that were recently implemented.
In most situations a statistical analyst has no or only limited control over the process that generates the raw data to be analyzed. Testing assumptions about raw data, intermediate and final results is therefore an essential part of any statistical analyses. Often, such domain knowledge can be expressed in short statements such as 'age must be non-negative' or 'if two persons live at the same postal code, they live in the same city'.
The 'validate' package makes it easy to formulate such assumptions, confront your data with these assumptions and subsequently to summarize or visualize the results in a transparent and reproducible way. In particular, 'validate' treats knowledge rules as first class citizens which means that they can be manipulated, documented, and stored and retrieved from file.
In this talk I will give an overview of the data validation infrastructure that the package offers, how it might be used, and I will highlight some upcoming developments.
A data manipulation case study using the dplyr syntax, probably using data from the Bechtel Test website about representation of women in the movies industry.
National Statistical Offices have started setting up web services to provide published information through data APIs. Even though international standards exist, e.g. SDMX, the majority of NSOs create their individual API and few use existing community standards.
nsoAPI is an attempt to create a single package with functions for each provider that convert a custom data format into an R standard time series format ready for analysis or further transformation.
https://www.gitbook.com/read/book/bowerth/opendata-tables lists tables that can be retrieved from SDMX (International Organizations, ABS: Australia, INEGI: Mexico, INSEE: France and ISTAT: Italy, NBB: Belgium), the pxweb package (PXNET2: Finland, SCB: Sweden) and the nsoAPI package (BEA: USA, CBS: the Netherlands, GENESIS: Germany, ONS: UK, SSB: Norway, STATAT: Austria, STATBANK: Denmark).
With the exception of France, large countries tend to set their own standards. The BEA (USA) and ONS (UK) require the user to create an ID that needs to be submitted with each request. GENESIS (Germany) require the user to pay 500 Euros per year (250 Euros for academic users) to access the API.
The impact of relevant events on the stock market valuation of companies has been the subject of many studies. An event study is a statistical toolbox that allows to examine the impact of certain events on the firms' stock valuation. Given the rationality of market participants, the prices of securities immediately incorporate any relevant announcements, information, and updates.
The idea of the event study is to compare the market valuation of the companies during periods related to an event and other (non-event linked) periods. If the behavior of stocks is significantly different in the event-period, then we conclude that an event produces an impact on the market valuation, otherwise we conclude that there is no effect.
The major stream of research is focused on the insurance industry and catastrophe events, therefore, the cross-sectional dependence cannot be neglected. Furthermore, the returns typically are not normally distributed. These points lead to misspecification of the classical parametric tests, and require to validate the results by more tailored and accurate tests (both parametric and nonparametric).
In order to incorporate all these issues we developed the package estudy2 (planned to be submitted to CRAN by August 2016). First, estudy2 incorporates all technical aspects of the rate of return calculation (the core computation is done in C++ by using Rcpp). Also the package incorporates 3 traditional market models: mean-adjusted returns, market-adjusted returns, single-index market model.
Finally, 6 parametric and 6 nonparametric tests of daily cross-sectional abnormal return have been implemented. In addition, the package contains the tests for cumulative abnormal returns (CAR).
In the proposed talk we demonstrate an example from current research, namely, the impact of major catastrophes on insurance firms' market valuation in order to validate the specification of the tests.
Machine Learning session
In addition to the H2O hands-on workshop in the morning, I will give a brief overview of some common H2O machine learning use cases. I will also talk about some recent H2O developments (e.g. Sparkling Water 2.0, integration with Google’s TensorFlow, our new product steam etc.).
Search strategies for new subatomic particles often depend on being able to efficiently discriminate between signal and background processes. Particle physics experiments are expensive, the competition between rival experiments is intense, and the stakes are high. This has lead to increased interest in advanced statistical methods to extend the discovery reach of experiments. This talk will present a walk-through of the development of a prototype machine learning classifier for differentiating between decays of quarks and gluons at experiments like those at the Large Hadron Collider at CERN. The power to discriminate between these two types of particle would have a huge impact on many searches for new physics at CERN and beyond. I will discuss why I chose to perform this analysis in R and how switching to R has helped my work.
Analyzing time series data is imminent in econometrics, weather forecasting, speech analysis, biosignal processing, and several other disciplines. Time series analysis consists of multiple techniques for data cleaning and pattern recognition. In this talk, I will present hidden Markov modelling (HMM) — a common unsupervised learning method — and show how it can facilitate the discovery of meaningful patterns in time series data. Hidden emotional and cognitive states were explored using the depmixS4 package in R (Visser & Speekenbrink, 2016, doi: 10.18637/jss.v036.i07) in a dataset of multimodal (EEG, GSR, HR) recordings while participants watched emotional videos. Using hidden Markov modelling we were able to identify emotional states and their dynamic transitions. We will discuss the issues of validation, reliability, and limitations of this approach and its implementation.
With modern advances in computing and the increasing abundance of digital data, it is becoming both feasible and necessary to expand our data analysis methods beyond conventional mathematical modeling and data visualization. Music and other emerging data analysis paradigms present the opportunity to represent high-dimensional data in intuitive and accessible forms. In this talk I will introduce the concept of plotting data as music and demonstrate some approaches to live music synthesis that are available within the R ecosystem.
Political connections can have a profound influence on the success and profitability of firms. However, discovering these connections are difficult since firms typically try to hide their political ties from the public. We link data on local and parliamentary elections to administrative data about firms to create features that can be indicative of political connections. Applying machine learning algorithms we build classification models with the goal of identifying political leaning of Hungarian firms at a large scale.
We use R for all parts of the project, starting from data wranging through visualization to building and evaluating machine learning models. Packages used include data.table, dplyr, ggplot2 and caret.
R use-cases session
R provides an immense amount of packages (which is not surprising, considering how awesome community it has), it supports nearly everything from effective data visualization through web scraping to the weirdest applications, like getting your favourite XKCD comics. No wonder that the really dedicated users -- independently from the exact domain they work in -- hardly ever need to switch to other environments while doing analytics-related tasks. The talk will cover my personal swiss army knife toolsets I use for infrastructure and business analytics. It will show:
Learning R is dangerous. It entices us in by presenting an incredibly powerful tool to solve our particular problem; for free! And as we learn how to do that, we uncover more things that make our solution even better. But then we start to look around our organisation or institution and see how it could make everyone's lives better too. And that's the dangerous part; R's got us hooked and we can't give up the belief that everyone else should be using this, right now. Even though R is free, open source software, there are often barriers to introducing it organisation-wide. This could be because of such things as IT or quality policies, the need for management buy-in or because of perceptions in learning the language. This presentation will first discuss the aspects required to understand these barriers to entry, and the different types of resolution for these. It will then use three projects to show how, by understanding the requirements of the organisation, and developing situation-specific roll-out strategies, these barriers to entry can be overcome. The first example is a large organisation who wanted to quickly (within 6 weeks) show management how Shiny could improve information dissemination. As server policies made a proof of concept difficult to run internally, this project used a cloud hosted environment for R, Shiny and a source database. The second example is around two SME's who required access to a validated version of R, which was provided via the Amazon and Azure marketplaces. The key aspect of these projects is the value to IT departments of being able to distribute a pre-configured machine around the organisation.
FRISS is a Dutch, fast growing company, which has a 100% focus on fraud, risk and compliance for non-life insurance companies. FRISS offers insurance companies a fully automated screening platform that enables insurers to make better and faster decisions via quantified risk assessments on persons, companies and objects. Within FRISS, R is used to create products that allow you to visualize and analyze data coming from the FRISS platform. In this session we show various advanced R Shiny applications, which include interactive dashboards with many custom output bindings, an interactive, dynamic help and a modular design, a NEO4j based network application with a force based layout engine and a reports application, based on HTML templates.
R is (for some of us at least) easier to use than to learn. It can also be difficult to keep up with new packages and other developments. R User Groups can help with both these challenges, as well as help members develop other skills (presentation, organisation, and networking), yet there is little enough helpful administrative or other support (templates, suggestions, or policies) for new and established groups. This lightning talk proposes to introduce the framing of a RUG Toolbox, a collection of templates and case studies, contributed by R users and R user groups, collated and made available on GitHub, which can help groups start, plan, develop and join R networks locally and further afield.
Collecting Pokémon data from websites using rvest
I use the software PHREEQC for modeling rock-water interactions. The output of these geochemical models are several types of complex data tables depending on weather I run thermodynamic mixing, kinetic and reactive transport models. By the increasing number of model runs, automatization of the figure production, easy visualization of output sensitivity on model parameters became necessary, as well as presentation of reactive transport of water in rock by not only figures, but also GIF animations. For these, I use R packages dplyr, reshape2, ggplot2, animation and graphics. I present code, the resulting figures and GIF animations of chemical changes in water while it flows through cells of defined rock, as well as indicated mineral dissolution and precipitation processes in space and time. I also mention R packages PHREEQC and ReacTran to better present the capability of R in the field of geochemistry.
Data Visualization Challenge with pizza on the side
Our gold sponsor, the BI Consulting team, provides an impressive dataset coming from the Hungarian Central Statistical Office on the flights to and from the Budapest Ferenc Liszt International Airport between 2007 and 2012 to the interested parties to do exploratory data analysis projects in the means of a data visualization competition.
Guidelines for participating in the contest:
The submissions will be reviewed by a committee of industry BI experts and R developers, nominated by the BI Consulting team. Their task will be to filter the applications to a reasonable amount of projects to be shown in a live session at the conference, where the attendees will vote for the best submission. The winner(s) of the challenge will receive a certificate, valuable prices (eg Budapest BI Forum conference ticket) and the honour to be the best damn data visualization wizards of the first satRday conference all around the world!
You can download the dataset here (updated: Aug 20 2016). Yes, it's an Excel file with a space in the file name. If you have any question about the challenge, Bence (email@example.com) will be happy to help.
Answers to some frequently asked questions:
May the ETL, modeling and dataviz gods be with you!
The conference will take place in the Research Centre for Natural Sciences of the Hungarian Academy of Sciences, which is a modern building in Infopark, Budapest -- located on the riverside of Danube.
The venue features a large room for the plenary talks with up to 250 attendees, which can be split into two smaller rooms for the parallel sections. Besides this main room, we will have access up to 6 workshop rooms as well.
Please stay tuned for more information after we managed to confirm the spot and get some time to do some more content management on this site.
At least one of the conference organizers considers September his favourite month weather-wise: The average daytime high temperature is a comfortable 24°C, while the average nighttime low temperature drops to around 15°C. Heavy rains are also uncommon in this period. For the most recent forecast, click here.
Parking is free in the neighbourhood of the conference venue, and usually there are a lot of free parking lots on Saturday. However, Budapest has a dense network of public transport lines (even at night), so you might consider leaving your car behind and taking the tram instead. Sporty visitors may choose the BUBI, the public bike-sharing system.
Smoking is banned in all enclosed public places, on public transport, and in all worplaces in Hungary. Regarding the conference venue, there is a small smoking area outside the building (at the main entrance).
Entering the conference venue with any open or visible alcohol container is strictly prohibited. The consumption of alcoholic beverages is forbidden in the conference building except for beverages served as part of the official catering.
This conference is dedicated to providing a harassment-free conference experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of conference participants in any form. Harassment includes offensive verbal comments related to gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, religion, sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, sustained disruption of talks or other events, inappropriate physical contact, and unwelcome sexual attention. Sexual language or imagery is not appropriate for talks, exhibitors’ displays, or social and dining events. Violators may be sanctioned, including being expelled from the conference without a refund, at the complete discretion of the Conference Organizing Committee.
Budapest offers many different accommodation options for backpackers, business travellers, tourists etc, please feel free to browser hotels, youth hostels, guesthouses and so on eg on booking.com, szallas.hu or Airbnb.
Update: we previously used to recommend
Visit Kollégium for budget accomodation not too far from the conference venue with discounted prices, but they turned out to be extremely unprofessional and a real PITA, as they basically canceled all our reservations out of the blue, so we suggest everyone to look for some other options.
You can reach Budapest in a variety of different ways:
Please click on the above references to see the timetables, plan a trip and to check the related Google Maps on how to reach the conference building from these locations.
Cabs are called "taxi" in Hungary and have a standard fare of:
you can also use Uber, and public transportation is also pretty good and much more affordable -- eg you can take bus E200 from the Airport to Kőbánya-Kispest, then the Metro (underground) to Corvin-negyed and tram 4 or 6 (3 stops that you can easily do on foot as well) for around 3-5 USD overall and usually in less than an hour.
The abstract submission form is
now openclosed since July 10.
Why should you consider giving a talk?
Please feel free to submit one ore more proposal(s) in English on the above URL with the following presentation formats:
Presenters of workshops are advised that each workhop room is equipped with internet connection (Wi-Fi) and a large LCD screen (HD or FullHD resolution) with an HDMI connector. Presenters must bring their own laptops.
Presenters are advised that each session room is equipped with:
If possible, please bring a copy of your presentation on USB stick and upload it to the PC in advance of the session. Use your own laptop only if it is really required. An IT/AV technician and conference assistants will be available on-site.