One of the most important factors for the success of the IoT is the ability of systems, services and applications to perform data mining. Why is that? Well, I think that one of the key roles of IoT is to drive smart interactions with users (like automation and decision support). To do so, systems need to collect information about users and their context (using sensors and web resources), make appropriate data analysis, filter data and present users the outcome or make smart decisions.
Before discussing about Data Mining and the IoT, let’s make first a short introduction on Data Mining. Data Mining is the process of identifying patterns in (usually) large data sets. To give you an example, consider having an activity tracker that you carry on all day long. Looking at the data the tracker collects, you see more activity during some evenings and on weekend mornings (because you go running during that time). You identify this pattern by correlating the activity value-score with time, comparing each value on the data with others. So actually you group activity values to different levels (medium, high, etc.) and then you register the time the grouped activity values take place. This process is called data clustering. While you can easily figure out such patterns yourself, imaging having hundreds or thousands of data entries and not just activity values and timestamps, but also duration, weather conditions, calories consumed, etc. To deal with such problems, computer science has applied statistical methods and built tools that allow you to perform Data Mining and exctract useful information out of the data sets. Some of the most important applications of Data Mining are data anomaly detection, data clustering, data classification, feature selection, and time series prediction.
IoT and data anomaly detection
image courtesy: Bayes Server
Anomaly detection can be a great feature for IoT applications. Let’s take again the activity tracker example. This time assume that you have set a monthly or weekly goal like loosing some weight or reaching an activity or calorie burning level. In addition to monitoring your activity, the system is also able to determine your daily calorie consumption. Let’s assume again that you go jogging every Tuesday and Thursday afternoon as well as weekend mornings. One Tuesday you neglect to go jogging and your daily activity falls low, while your calorie consumption remains the same. This is an anomaly for the system. If your tracking application was featuring data mining techniques, it would be able to remind you the following day to become more active and not to neglect your jogging on Thursday (or worse, it could tell your friends that you become lazy: interesting application for social networks + activity tracking services!).
Again, this is a simple example. But consider that you track your parents activities (like how often they leave the house, when they enter, how much time they spent in a room, etc.) through motion sensors, and their home environment conditions. If one day they go out for shopping and for some reason they are late, or if someone spends too much time in the bathroom, an anomaly detection system could alert you automatically.
IoT and data clustering
image courtesy: Bayes Server
As mentioned in the given example with the activity tracker, data clustering refers to grouping of data based on specific features and their values. It is the most common process of unsupervised machine learning. It is called so because in other processes like data classification, you need to ‘train’ the system first with data (think of the initial voice recognition systems where you had to train the system calling out specific words). Data clustering can be applied however on a new data set without really knowing much about it in advance (e.g., what kind of data, etc.). The number of clusters is usually given as an input (e.g., the number of activity levels the motion data should be divided into), but there are also algorithms that can automatically categorize data in the most optimal way. Data clustering maybe not be used directly in IoT applications, but in many cases it can be an intermediate step for identifying patterns from the collected data.
IoT and data classification
Data classification is used when the collected data is associated to a different classes. Think of the classes as groups which can correspond to situations, for instance, high or low activity in our example. Once you have some initial data of your activity tracking and you have already clustered them into high or low activity, you can use tools to build a training model. Then you can use the model to associate new tracker values into low or high activity. Data classification is not prediction (mentioned below). It is categorisation of new values. In the activity tracking example, the device you carry on senses motion using accelerometers and tilt sensors. Data classification can be one of the techniques used by the device vendor to associate the sensor values with steps, stairs climbing or sleep status. After clustering the steps and stair climbing into low, high or medium activity, the application can use again data classification in order to determine your overall activity at the end of the day.
IoT and feature selection
In data mining ‘features’ (or ‘attributes’) are called the types of data used for pattern recognition. For example, in the scenario with the activity tracker, features can be the activity level, the timestamp, the duration, the calories, etc. Most activity trackers are also able to track sleep levels (awake, asleep) and calculate a sleep quality index. These are all features too. So, let’s say that you collect all these data for some time and you also keep tracks of your daily sleep quality index. Feature selection is the process that allows you to identify features that affect your sleep quality index the most. For example, it could be that activity does not affect your sleep much but daily calorie consumption does.
Feature selection is mostly used to reduce the dimensionality in data mining problems. After some initial experimenting, you could apply feature selection, identify what are the features that affect a specific problem most, and then perform data classification, time series prediction or anomaly detection more easily (if you can reduce the feature dimensionality enough you could even perform some basic data mining even on an Arduino module!).
IoT and Time series prediction
image courtesy: Bayes Server
Time series prediction provides exactly what its name says: an estimation of what future data can be based on a specific dataset that is already collected and analysed. The most popular application of time series prediction is meteorology and weather forecasting. In our example time series prediction could work as follows: after collecting activity data for several days, the system can identify your daily activity patterns and associate them with time and date. So it can predict that on Wednesdays your activity remains low but on Tuesdays it is supposed to be high and should alert you in case you are far from weekly target.
IoT and Big Data Mining
Big Data, in the case of IoT, is not only about lots of data generated continuously by sensors and other IoT enabled devices. Big Data is also about data heterogeneity: collecting, analysing and correlating data from different resources! So, mining the data is not the biggest challenge here: technologies and algorithms have existed for quite some time and have been successfully used in economics, meteorology, medicine, etc. For IoT systems to be able to automatically process the data and make proper recommendations or notifications, there have to be ways to combine data efficiently (e.g., activity levels with daily calorie consumption), and annotate them properly. The latter means to label the data in a way that your pattern recognition system can identify that a specific sensor sends activity data while your mobile phone sends food calorie information. Semantics is the answer to this problem. Data standards and interoperability between devices and services is the answer to optimizing data collection and making the data available for processing and data mining.