inicio mail me! sindicaci;ón

retherford.org

Application Presence

I’ve been thinking a lot about distributed applications lately, in particular, those that make use of distributed data stores and XMPP.

What began as an experiment with application log monitoring and retention using log4j and XMPP has grown into an idea for presence at the application or service level.

What if your internet service or application used XMPP to report changes in “mood” or more specifically changes in service level? What if every application had an embedded SLA monitoring and reporting component?

Let’s say that you run an internet-scale service such as del.icio.us or flickr and every node communicated it’s service level to a federated XMPP service. Your ops team could subscribe to specific or aggregated service levels and receive notification of SLA thresholds and at the same time have real-time presence or “pulse”.

You could define service level or “mood” for the application at specific granular levels and use simple filtering of log messages to set/project the application’s “mood”.

Storage service level: 100% (no outage in the last 10 mins.)
User service level: 99.9% (5 dropped in the last 10 mins.)


Over all service level: 99.99% (within advertised SLA range)

Share This

2 Comments »

  ChrisHardie wrote @ September 4th, 2008 at 6:49 am

I like this idea. While we’re far from having XMPP integration at the app level, we’ve really enjoyed having real-time notifications of changes in service status, triggered from all over our network, as a part of our internal use of IRC channels. As a team, we can identify via the chat who will be looking into a given issue, and can quickly report initial findings, etc. It’s reduced the amount of overhead that goes with maintaining complex systems, which is great. Unfortunately, we haven’t yet gotten to the point where it completely replaces the complementary e-mail messages and pager alerts for the same events, so we still have a ways to go…

  paulr wrote @ September 10th, 2008 at 7:02 am

Chris, that sounds like a good mix of real-time, near real-time and highly reliable notification. I like that the ops team members (who themselves may be distributed) coordinate their response to the notification using the same infrastructure. Two ideas I didn’t mention in the post are 1) log retention/search and 2) two-way communication between the recipient and the app. Of course 2) is harder and requires instrumenting the app to respond to service level requests. An easy example would be changing log levels. A more complicated one might be coordinating the fail-over from one node to another.

Your comment

You must be logged in to post a comment.