Blog

Recent news, case studies and technology insights from the CodiLime team

Running distributed TensorFlow on Slurm clusters

In this post, we provide an example of how to run a TensorFlow experiment on a Slurm cluster. Since TensorFlow doesn’t yet officially support this task, we developed a simple Python module for automating the configuration. It parses the environment variables set by...

read more

Microsoft Windows Nano mass deployment using VMware

Last year Microsoft released Windows 2016, which contained Windows Nano 2016. Nano is the answer to the market need for a lightweight system. In other words, Nano Server is a very fast and powerful tool for remote administration of Windows Servers. How Nano can be...

read more

Machine learning application in automated reasoning

It all started with mathematics – rigorous thinking, science, technology. Today’s world is maths‑driven. Despite recent advances in deep learning, the way mathematics is done today is still much the same as it was 100 years ago. Isn’t it time for a change?

read more

Region of interest pooling in TensorFlow – example

In the previous post we explained what region of interest pooling (RoI pooling for short) is. In this one, we present an example of applying RoI pooling in TensorFlow. We base it on our custom RoI pooling TensorFlow operation. We also use Neptune as a support...

read more

Training XGBoost with R and Neptune

In this blogpost we present the R library for Neptune – the DevOps platform for data scientists. Neptune’s R extension is presented by demonstrating the powerful XGBoost library and a bank marketing dataset (available at the UCI Machine Learning Repository). The goal...

read more

Region of interest pooling explained

Introduction In this post, we’re going to say a few words about an interesting neural network layer called region of interest pooling (also known as RoI pooling), the implementation of which we’ve recently opensourced (you can find it here:...

read more

Machine Learning for Applications in Manufacturing

Modern manufacturing technology is starting to incorporate machine learning throughout the production process. Predictive algorithms are being used to plan machine maintenance adaptively rather than on a fixed schedule. Meanwhile, quality control is becoming more and...

read more

GeoJson Operations in Apache Spark with Seahorse SDK

A few days ago we released Seahorse 1.4, an enhanced version of our machine learning, Big Data manipulation and data visualization product. This release also comes with an SDK – a Scala toolkit for creating new custom operations to be used in Seahorse. As a showcase,...

read more

Scheduling Spark jobs in Seahorse

Introduction In the latest Seahorse release we introduced the scheduling of Spark jobs. We will show you how to use it to regularly collect data and send reports generated from that data via email. Use case Let’s say that we have a local meteo station and the data...

read more

An internal validation leaderboard in Neptune

Internal validation is a useful tool for comparing results of experiments performed by team members in any business or research task. It can also be a valuable complement of public leaderboards attached to machine learning competitions on platforms like Kaggle. In...

read more

Neptune 1.3 with TensorFlow integration and experiments in Docker

We’re happy to announce that a new version of Neptune became available this month. The latest 1.3 release of deepsense.io’s machine learning platform introduces powerful new features and improvements. This release’s key added features are: integration with TensorFlow and running Neptune experiments in Docker containers.

read more

Machine Learning Models Predicting Dangerous Seismic Events

Underground mining poses a number of threats including fires, methane outbreaks or seismic tremors and bumps. An automatic system for predicting and alerting against such dangerous events is of utmost importance – and also a great challenge for data scientists and their machine learning models. This was the inspiration for the organizers of AAIA’16 Data Mining Challenge: Predicting Dangerous Seismic Events in Active Coal Mines.

read more

Playing Atari games using RAM state

In 2013 the Deepmind team invented an algorithm called deep Q-learning. It learns to play Atari 2600 games using only the input from the screen. Following a call by OpenAI, we adapted this method to deal with a situation where the playing agent is given not the screen, but rather the RAM state of the Atari machine. Our work was accepted to the Computer Games Workshop accompanying the IJCAI 2016 conference. This post describes the original DQN method and the changes we made to it. You can re-create our experiments using a publicly available code.

read more

Euro 2016 Predictions Using Team Rating Systems

The 2016 UEFA European Championship is about to kick-off in a few hours in France with 24 national teams looking to claim the title. In this post, we’ll explain how to utilize various football team rating systems in order to make Euro 2016 predictions.

read more

US Baby Names – Data Visualization

A few days ago we released Seahorse 1.1, an enhanced version of our machine learning, Big Data manipulation and visualization product. Today, we will show you how the new version of Seahorse can be used for data mining and data visualization.

read more

Improve Apache Spark aggregate performance with batching

Seahorse provides users with reports on their data at every step in the workflow. A user can view reports after each operation to review the intermediate results. In our reports we provide users with distributions for columns in the form of a histogram for continuous data, and a pie chart for categorical data.

read more

Should I eat this mushroom?

A few days ago we have released Seahorse 1.0, a visual platform for machine learning and Big Data manipulation available for all, for free! Today, we show you how to use Seahorse to solve a simple classification problem.

read more

Cooperative data exploration

Living in a world of big data comes with a certain challenge. Namely, how to extract value from this ever-growing flow of information that comes our way. There are a lot of great tools that can help us, but they all require a lot of resources. So, how do we ease the burden on this CPU/RAM demand? One way to do it is to share the data we are working on and results of our computations with others.

read more

Exploration of data from iPhone motion coprocessor (2)

Last week we have downloaded and loaded into R data from fitness tracker (motion coprocessor in iphone). Then with just few lines of R code we decomposed the data into a seasonal weekly component and the trend. Today we are going to see how to plot the number of steps...

read more

Exploration of data from iPhone motion coprocessor

During the Christmas break I met my brother-in-law who is an ultimate gadgeteer (an excellent trait for brother). He told me that most iPhones have build-in motion coprocessor and by default they are counting steps. No need to turn on anything, it is working all the...

read more

How to create a new geom for ggplot2

The new version of the ggplot2 package (v 2.0.0) will be available on CRAN in a few days. It has a very nice mechanism for adding new geoms and stats (more about it here). Let's create a new geom geom_christmas_tree() that will plot data with the use of christmas...

read more

Hack the Proton

I’ve prepared a short console-based data-driven R game named ,,The Proton Game’’ or ,,Hack the Proton'' (still cannot decide which name is better). The goal of a player is to play the hacker and infiltrate Slawomir Pietraszko’s account on a Proton server. To do this,...

read more

R vs SAS vs SPSS

Such titles, in many cases, are just introductions to flam wars. But not on this blog. Today we are going to illustrate some subtle differences among three statistical packages, R/SAS/SPSS. Small differences, but sometimes even a very small difference may have large...

read more

multidplyr: first impressions

Two days ago Hadley Wickham tweeted a link with introduction to his new package multidplyr. Basically it’s a tool to take advantage of many cores for dplyr operations. Let’s see how to play with it. What you can do with multidplyr? As it was described on GitHub...

read more

Understanding Apache Spark’s Execution Model Using SparkListeners

When you execute an action on a RDD, Apache Spark runs a job that in turn triggers tasks using DAGScheduler and TaskScheduler, respectively. They are all low-level details that may be often useful to understand when a simple transformation is no longer simple performance-wise and takes ages to complete.

There are a few ways to monitor Spark, and WebUI is the most obvious choice, but you should not regret hearing about Scheduler Listeners.

read more

Machine Learning for Greater Fire Scene Safety

The lives of brave firemen are threatened during dangerous emergency missions while they try to save other people and their property. In this post I would like to share my experiences and winning strategy for the AAIA’15 Data Mining Competition: Tagging Firefighter Activities at a Fire Scene, in which I took first place.

read more

Data mining of the votes of Members of Parliament

7th term of the Sejm has already come to its end. It would be nice to see how have the Members of Polish Parliament voted for these last 4 years! In total they took part in over 6000 votings. Did the representatives of the same clubs voted more similarly to each...

read more

Do cats or dogs live longer?

Some time ago our herd has expanded by a guinea pig called Hugo. It turns out that the presence of a pet at home is a great pretext for discussing with children the concepts of randomness, distribution functions and distribution in general. And this is how it started:...

read more

Statistician like a shoemaker

Children bring from school strange home assignments, like for example a question: What is your dad’s job similar to? After several hits (a cosmonaut, Formula 1 driver, firefighter) it turns out that the work performed by a statistician is very much similar to the work...

read more

Multilevel classification, Cohen kappa and Krippendorff alpha

I was facing an interesting problem last week. Playing with data from The Genome Cancer Atlas (full genetic and clinical data for thousands of patients) I was building a classifier that predicts the type of cancer based on sets of genetic signatures. In the PANCAN33...

read more

Biplots, correspondence analysis and ggplot2

I was looking for biplots created with the use of ggplot2 library (because they look good and are customisable). It turns out that there are some nice solutions for PCA (like sinhrks/ggfortify; kassambara/factoextra; vqv/ggbiplot; fawda123/ggord) but I could not find...

read more

Diagnosing diabetic retinopathy with deep learning

What is the difference between these 2 images? The one on the left has no signs of diabetic retinopathy, while the other one has severe signs of it. If you are not a trained clinician, the chances are, you will find it quite hard to correctly identify the signs of this disease.

read more

circles and films from FilmWeb

I have been working recently on visualization of genetic data. In that field a popular method of presentation are circles generated by the circlize library. Turning a blind eye to the problem of reading information from circles I must say that the possibilities...

read more

Transformations of variables, scales and coordinates in ggplot2

I am working on a short introduction to the Grammar of Graphics and its implementation in the ggplot2 package. Process of systematization of the elements of syntax reveals various ‘spices’ of ggplot2 and today I will talk about one of them, namely about application of...

read more

useR 2015 and htmlwidgets

I've been wondering if this year's useR conference foreshadowed some gigantic groundbreaking change in the world of R. The previous useR conference was a sort of catalyst for dplyr package and operator %>%. The profession (especially from California) had been...

read more

useR2015, statistics education and data analysis

R program has been developed for years as a tool for learning statistics / data analysis. It is perfectly suited for that purpose and it is employed as a teaching tool at more and more universities and, recently, private companies. The more places the program is used...

read more

Sapkowski, Dukaj and the wikipediatrend package

Recently I tested a quite nice package for R: wikipediatrend (available on CRAN). With just a few lines of code, it can easily download and visualize daily wikipedia page views statistics. Great package, so we are going to take a closer look. I’ve just finished Season...

read more

archivist 1.5

Archivist is an R package for object management (storing, sharing, searching). I am going to present it on useR conference next week (hope I can meet some of you in Aalborg). Below you will find two coolest (imho) features implemented in the version 1.5 of archivist....

read more

Shiny, polls and interactive ggplot2

Today we will use ggplot2 to recreate the diagrams presenting support in voting intention polls conducted before presidential elections. The story behind is interesting so let’s see it again. Yesterday RStudio has released a new version of shiny. Version 0.12 comes...

read more

Data visualization vs. information management

Yesterday I had a presentation ‘Data visualization vs. information management’. The core of the presentation were two examples which I present below. The punch line comes down to a simple statement: It is not enough to present data graphically; its presentation must...

read more

Mice, post hoc tests and diffograms

I’ve recently worked on an interesting problem. There are two types of mice. We select three animals of each type. We want to examine an effect of given treatment on nerve cells, more specific: on their dendritic spines (small protrusions located at neurons). From...

read more

The marathon of teams’ data analysis – wrap-up

The first team's data analysis marathon took place on last Saturday. Almost 60 participants turned up to take part in it (representing various levels of proficiency in the art of data analysis and different regions of Poland –most were from Warsaw but there were also...

read more

The marathon of teams’ data analysis

In just four days’ time we are going to start a marathon of teams' data analysis. This time it’s a local Warsaw event, but next time? It’s up to us! Let us sum up what we know about that event. CodiLime (DeepSense) is a sponsor. Organizers include Smarter Poland and...

read more

Colors of cars

Last week we tried to find out what is the color of the cars with the highest engine power. It turned out that black and black metallic are most popular colors of the fastest cars. Yet engine power is not all. We still may explore the relation between color and brand....

read more

What color car is the fastest?

RECOMB 2015, a conference devoted to computational molecular biology (with emphasis on computational), came to an end yesterday. Many interesting papers were presented, yet this post was inspired by a conversation that I had the pleasure to have during dinner break....

read more

IMDB + ggvis, a happy couple

Two weeks ago we showed how to scrap data from IMDB database with the use of rvest package. Last week we showed a shiny application, that compares ratings from two selected groups of users. Today we are going to finish the IMDB trilogy. This time I am going to show...

read more

You should not watch these movies with your wife / girl

Last week’s post showed how to download data on ratings of over 200 television series. The rating was broken down by gender and age of the user. The application presented below allows for selection of any two age/gender groups of users and comparison of their ratings...

read more

R, rvest and web-harvesting

Data harvested from the web pages is a source of interesting information. Pulling data used to require quite a lot of resilience and misshapen Perl scripts struggling with messy sources of web pages. Today’s web pages more and more frequently comply meet the...

read more

Canonical discriminant analyses and HE plots

Last week we wrote about multidimensional linear models. We discussed a case in which a k-dimensional vector of the dependent variables is related to a grouping variable. We look at matrices E and H in order to find out whether there is any relationship (see the...

read more

HE plots

GPS helps the drivers to avoid traffic jams, yet in more advanced uses it allows for fleet management or remote drone strikes. It is just the same with visualization. Bars and dots can be used to present a set of several means but there are also more advanced uses...

read more

Spark + R = SparkR

Spark wins more and more hearts. And no wonder, comments from different sources tell us about a significant speed up (by an order of magnitude) for analysis of big datasets. Well-developed system for caching objects in memory allows us to avoid torturing hard discs...

read more

Pretty heat maps

Do you know where Kamil Stoch earns most of his points in season 2013/2014? Some time ago I came across a pheatmaps package (see here) for R software which generates much nicer heat maps than the standard heatmap() function. This is why the package is named...

read more