While the idea of pattern recognition for the Air Quality Egg is gaining attention and people willing to help I though to use existing feeds on Pachube and see if there are any patterns easily recognizable through standard techniques.
When searching for ‘Air Quality Egg’ feeds on Pachube, I have two of them that seem to be consisting of prototypes and constantly reporting sensor values:
So initially, I have used the Pachube API to retrieve past values of AirQuality, CO, NO2, temperature and humidity datastreams. The values taken are from various days between 16/04-25/04 and during various time slots within each day (at 1hr intervals). A number of abt 900 datastream entries have been collected.
Then I have used the WEKA data mining tool to make some rough analysis using K-Means clustering. I have used 3 clusters as an input (corresponding to potential air quality levels like good, medium, bad). Not even knowing what the actual explanation of the Air Quality datastream is (and what the values mean), clustering appeared to have done a good job in identifying properly the clusters (regarding the Air Quality) and also visualizing the association of the other sensor data (humidity, temperature, NO2 and CO):
The first picture (click to view actual size) depicts the Air Quality (X-axis) vs CO (Y-axis) association. The cluster identification looks very clear to what could be interpreted as low, medium and very good quality of data. A first assumption from this graph could be that the collected CO level range (0-25) does not seem to affect a lot the
air quality sensor readings for the air quality..
The second image visualizes the correlation between air quality and humidity. Again the 3 different types of AQ seem to be easily distinguishable, but in this case humidity effects significantly the
quality of air AQ sensor readings; high humidity indicates low AQ readings.
The NO2 levels on this image also do not seem to affect the AQ much. Temperature also seems to have no impact:
So far, this initial analysis has demonstrated that:
a) Air Quality on the two selected sites can be grouped to 3 distinguished clusters
b) Only humidity levels seem to affect the sensor readings for the Air Quality.
Probably the measured range of NO2 and CO levels is two small to play a significant role in AQ. Hopefully when Eggs arrive to their owners and users start generate more data, analysis will show more interesting results.
In the meantime I plan to make this an online service where users can enter feed ids and data will be automatically clustered and visualized. Also based on the clusters, an initial training model can be built so that new feed data can be associated by the service to one of the cluster-categories.
Any volunteers to help with J2EE and the web front end?