Thursday, June 21, 2012

Who Watches the Watchmen?

Who Watches the Watchmen?

                               By Dave Edstrom
                         For the June 20th, 2012 IMTS Insider

"Quis custodiet ipsos custodes" is roughly translated to "Who will guard the guards?" As manufacturing becomes more and more about software, the importance of monitoring your computer networks will increase significantly. By computer networks I am referring to all of your systems that are not specifically manufacturing equipment. The ability to absolutely know that your software is up and running well becomes imperative. Typically this is done with monitoring software. This begs the sometimes obvious and not-so-obvious questions that are worth exploring.
Back in the late 1980s there was a huge rush of companies to come out with network monitoring programs. Companies realized they were investing significant sums in their servers, PCs, Macs, workstations, routers, bridges and networks so they needed to monitor and maintain these expensive and important resources. Sun Microsystems was an early leader with a product called SunNet Manager. It was a great product and showed really well with all of its graphics and the ability to drill down, send alerts and keep everyone updated on a real-time basis
There were two big challenges with network management and these were not technical, but a combination of cultural and business challenges. Those were so prevalent that I would start my presentations off with this statement: “I am going to ask you two questions. The first question you will answer ‘everything,’ and that will be the wrong answer. The second question you will answer ‘I don’t know,’ and that will be the right answer.”
The first question — answered “everything” — was, “What do you want to monitor?” I would then explain that with hundreds of metrics, monitoring everything is not viable. What would I suggest? It depends, but certainly there are important metrics for any system in terms of CPU load, network load, disk drive access, memory usage and types of applications running to name just a few.
While this answer was technically accurate, it really did not address their question. For example, many times monitoring software would be used as a foundation for high availability (HA) class of systems. HA systems are those systems that typically require four nines or 99.99% or greater uptime. This means a total downtime of less than one hour per year. However, just because an HA system is up, does that mean that the database is running properly and accepting transactions? Just because a computer is running and all the processes appear when you issue a process status command, it does not mean it is operating properly end to end.
The second question — answered “I don’t know” — was a very tough one, “What do you want to DO when one of these events occur?” This is the human side of monitoring the monitor. That was the really tough question because it would involve both technical and business input. For example, your primary server is running very slow because it is running out of memory. Which processes do you want to kill? Not an easy question on a shared server. You can certainly take a look to see if a specific process is out of control, but what happens if this has just been a slow and gradual increasing of load over time and not an obvious out-of-control metric? What if everything is running fine, you just have too many processes for the server? Buying and installing more memory or even buying another server might be an option, but that does not answer the question – what do you do right now? If this happens at 2:30 a.m. on a Saturday, who makes the decision? Who is monitoring the monitor? Is it software and/or a human? Are all the decisions automated or is it a work flow that involves humans at a specific point?
There are many common threads between both monitoring your shop floor and your computer network. These common threads are why I decided to follow up with this article after my previous discussion of Turner’s Five Laws of Manufacturing. Monitoring your shop floor or your computer network is not enough. You must have a culture of being data driven with a champion. Data driven manufacturing is where decisions are made with data from a variety of systems in a logical fashion with input from all of the stakeholders. As a refresher, here are Turner’s Five Laws of Manufacturing:
  1. We measure what goes Into production and what comes out, we have little data on what really happens on the production floor
  2. If anyone says “I know exactly what is happening on my plant floor” – don’t believe them
  3. We don’t gather data because it Is hard, and someone has to look at it
  4. No one solution or set of data works for everyone
  5. If you don’t have an avid champion, save your time and money
Monitoring your computer network is much easier than monitoring your shop floor because it is a well-understood science and there are tons of tools out there in terms of open-source and proprietary monitoring software.
It used to be that you had to install the software locally for computer network monitoring. Today, there are numerous cloud monitoring services that will monitor your systems. Very detailed monitoring still typically involves locally installed software. More and more companies are using software in the cloud to substitute or augment software that they previously would load and run locally. Many of these companies in the cloud monitor their own systems and have impressive uptimes.
But what happens when these services are down? How do you know? Are you notified? What actions can you take when a vital cloud service that your business depends on goes down? This is where "quis custodiet ipsos custodes" or "who watches the watchman” comes into play. 
As a company starts to put monitoring in place, it is very helpful to remember The Eight Fallacies of Distributed Computing by Peter Deutsch, James Gosling and others at Sun Microsystems.
  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.
The main point of the above list is to never assume the condition of your network when developing software or monitoring systems – you must know.
Whether you are monitoring your shop floor or computer network, there is no “insert here” canned magic solution in terms of what is important to monitor and what should I do when these events occur. What is most important is putting together a team that is led by a champion to evaluate the data, take action, monitor your actions, and continue to adjust. This team should be made up of a variety of disciplines and must meet on a regular basis.
As you introduce monitoring into your shop floor and your network, remember Turner’s Five Laws of Manufacturing and the Eight Fallacies of Distributed Computing. Finally, your goal in monitoring is the ability to answer these five simple monitoring questions:
  1. Who will be our monitoring champion?
  2. Who will be on our monitoring team?
  3. What should we monitor?
  4. What will we do when these events occur?
  5. Who monitors the monitor?
Whatever monitoring software you decide to go with, make sure you can try before you buy. This is critical. Bottom line is that you can’t manage what you can’t monitor and make sure you know what and who is watching the watchmen - quis custodiet ipsos custodies.