I hope you are all well and recovered from #data14. What a great event that was.
I’m gonna get a bit serious on yo now. It’s time to talk monitoring.
For a Tableau service manager (or any IT service for that matter), the worst situation that can possibly occur is getting a phone call from your users to tell you that your service is down. At best you’ll look stupid, at worst it will cost you credibility and is a sure-fire way to destroy user confidence in your service.
So how do you avoid this? You could not have any outages – well you can forget that, it aint gonna happen. You’ll get issues so get ready for them. What you can do is monitor your service big time. That way you’ll get the heads up and you can answer that phone call with a “yep we know, we’ve just raised an incident ticket and we are on it” – or better still, get to the incident and fix it before users even notice! Remember that effective incident management can actually gain you plus points from your user base, and senior management.
The problem with monitoring is that it’s BORING. I should know I did it for 12 years! But it’s also essential! Get it right and you’ll be making your life a lot easier. It also traditionally doesn’t get a whole lot of investment thrown its way as there’s no immediate tangible business benefit.
Monitoring falls into these categories. This is likely to take me more than one post to explain and it’s a big subject so I’ll doubtless miss some bits out. As always, I’m happy to connect offline and explain.
- Infrastructure monitoring
- Application monitoring
- Performance monitoring
- Capacity monitoring
- User experience monitoring
As the name suggests this is all about monitoring of your infrastructure. That’s your hardware and network, peripherals and components of the platform your Tableau Server application is running on.
Chances are the infrastructure will be owned by an IT team. You’ll need a great relationship with these folks so if you haven’t then start buying them some doughnuts now. From what I can see Tableau is often brought into organisations by business users and that then antagonizes IT, meaning this relationship isn’t always the best. That’s a separate conversation however.
How does infrastructure monitoring work?
Chances are your monitoring team will have decided on an enterprise monitoring tool for the whole organisation. It will probably take the form of a central server, receiving alerts from an agent that is deployed as standard on each server in the estate.
Some examples of commonly used monitoring tools include the following. I’ve got a fondness for ITRS Geneos myself but am not going to go into the relative merits of each tool. You won’t have a choice what tool is used in your org anyway.
So what happens? Well the agent will have a set of monitoring “rules” that it adheres to. These will take the form of something like “check the disk space on partition X, every Y minutes and trigger an alert if greater than Z percentage full”. That’s all the agent does. Polls the server for process availability, disk space, memory usage etc on a scheduled frequency and triggers an alert to the central server if the condition is breached. Those parameters should be fully configurable.
The central server will then display the alert on an event console such as this one (pictured). Alerts will be given a criticality such as minor, major or critical. The alert console will be viewed by a support team, usually an offshore Level 1 team that provides an initial triage of the alert. They may then pass it onto a Level 2 team for potential remediation, or they may also pass it on to Level 3 – the main support team. That’s the usual process in a big organisation.
So what’s the issue with that? Well there’s the time factor for one. It can sometimes take 20 – 30 mins for an alert to get to the person that matters. That’s obviously not great. Also there’s the sheer volume of alerts, a big organisation can be dealing with tens of thousands of active alerts a week, many of them junk. That increases the risk of your alert being missed. There are also a lot of break points in the process, and sometimes alerts just go missing due to lost packets, network issues etc. It happens. On the whole the process works though.
Who’s responsible and what for?
Your infra teams are 100% responsible for the monitoring of these components. This encompasses
- Server availability (ICMP ping)
- CPU usage
- Memory usage
- Disk space (operating system partitions only)
- Network throughput / availability
They’ll tell you not to worry about this. They’ll tell you that any alerts will go to their support teams and they’ll be on it should they detect an issue. My advice – don’t trust anyone. There have been many times where I’ve had an issue and lo and behold the monitoring hasn’t been configured properly, or hasn’t even been set up at all. Or there’s been a bad break in the process somewhere. That aint cool.
So what should I do?
Take these steps to keep your infra teams on their toes. They’re providing you a platform, you are entitled to ask. They might not like it, but stick to your guns – you’ll be the one who gets it in the neck if your Tableau Server goes down.
- Ask for a breakdown of the infra monitoring thresholds – What’s the polling cycle for alerts? What thresholds are being monitored? Who decided them and why?
- Ask for a process flow – What happens when an alert is generated? Where does it go? How long does it take for someone to get on it? How is root cause followed up?
- Ask to have visibility of the infra changes – If there are changes going on to the environment that might affect your server, make sure you get notified. Make sure you attend the appropriate change management meetings so you know what’s going on.
- Ask for a regular report on server performance – There will probably be a tool on the server that logs time series data on server performance. That should be accessible to you as well as them. Chuck the data into Tableau and make it available to your users.
- Understand the infra team SLA – It’s important to realise that you are a customer of the infra teams. Ask them for a Service Catalogue document for the service that they are providing. Understand the SLA that they’re operating to. Don’t be out-of-order, but if you find they’re not giving you good service then don’t be scared to wave the SLA.
- Ask for a report of successful backups – Just as important as monitoring
- Ask for the ICMP ping stats – How many packets get lost in communications with your Tableau server? How many times does it drop off the network?
- Be nice – The infra teams in big orgs have a tough job. They’ll have no money and little resource. Cut them some slack and don’t be a prat if they let you down occasionally. It happens.
Start with that lot. Your users will also love it if you can make this information available to them. Again, it inspires confidence that you know what you’re doing.
OK that’s it for infrastructure monitoring. Next up I’ll dive into how you monitor your Tableau Server application.