SyslogDecode — a new open source library from Microsoft for processing Syslog messages

(with contributions by Georgi Chkodrov)
Github repo: https://github.com/microsoft/SyslogDecode

What is Syslog?

No, not this kind. Like this one:

Syslog is a logging protocol. Basically, syslog is a stream of text messages, somewhat structured.
So - what’s the big deal?- you may ask. Just a text log, one of many.
Well, for one thing, Syslog is really important, because it is ubiquitous — it is almost everywhere.
On Linux — the syslogd (lately rsyslog or syslog-ng) is the most common logger for Linux and Unix. Syslog is also widely used by networking devices (routers, switches) that don’t have their own UI or log storage — they just send syslog messages over UDP/514 to the target listening server.
Likely quite a few Syslog messages were produced while this Web page that you read now was built and delivered to your browser.

Producing syslog messages is easy, that’s what made it so popular. The problems come on collection side at enterprise scale — when you have many, many devices/apps producing syslog, and you need to collect them at the central syslog server and save in a persistent storage of some kind. And not only store, but make it searchable, so that you can find this needle in a haystack.

There are syslog servers available in the market — both free and commercial products (see links at the end of the article). But we did not find them satisfactory for our search requirements at our scale.
This article gives a brief overview of Syslog, the challenges it brings in enterprise, and the solution we share.

Who we are

Syslog — quick overview

<14>Mar 3 20:27:49 Corp-Server-X2 snmp#supervisord: WARNING: Invalid mgmt IP 22.111.33.44,2234:23d5:e0:25a3::55

Just by looking at the message, we can guess some elements — it contains a timestamp, probably a machine name (host), and a textual warning at the end; there are 2 IP addresses inside the text. So the structure is like:

<PRI> TimeStamp HostName ProcId: MESSAGE

The PRI segment is interesting: the integer number inside the angle brackets (14) is an encoded combination of 2 values: Facility and Severity. What these are is not important for us now, but it’s worth noting that these 2 distinct and informationally important values are hidden inside a single integer. Clearly this makes it quite challenging to search for specific Severity values in a big pile of messages.
When sending over the network, Syslog messages are usually transmitted over UDP protocol, port 514. This is a common (standard) arrangement.
And that’s it about Syslog basics — just form a message with ‘standard’ parts and send it over UDP.

So, what’s the problem?

- RFC-3164 — overview of established practices (2001)
- RFC-5424 — actual standard (2009)

Interesting fact: RFC-5424, the actual standard, is not backward compatible with typical patterns described in RFC=3164.

These are long and detailed documents, but unfortunately…
As of today, not many folks/apps/devices follow the standard, or even pretend to follow. In our streams only around 15% of messages follow RFC-5424, the actual standard. About 50% are more-less in line with RFC-3164. The rest — it is like Wild Wild West, nobody cares about the rules.
As an example, devices from one big vendor produce messages that are just key-value pairs, without any ‘standard’ columns — they don’t look like Syslog at all. Sometimes it is in fact a sabotage of the standard — vendors want you to buy their own tools for handling syslog messages produced by their devices.

This lack of following the standard creates big problems for syslog servers —how to store and search this mess.

Storage and Search — just as text?

This is in fact how many existing solutions actually work — store plain text, apply Regex or some text functions at query time. However, this does not work at scale. If you have a few apps/devices and a few thousand messages per day — no problem; the only trouble is writing quite complex Regexes. But if you have billions of messages in a pile —queries hang forever.
Free-text search tools do not work either. You do not always have an exact and unique token to search for; analysts need to search for patterns and complex conditions.

SyslogDecode — Parse and index it all

Notice the PayloadType column — it contains the detected ‘kind’ of message structure (which RFC it follows if at all).
The parser extracts the standard elements of the message, and the values end up in dedicated columns in the database table. The values should be indexed and are therefore efficiently searchable.

Please note that we do not claim to be the first to use this parse-before-save approach, there are other Syslog software vendors that employ somewhat similar approaches, but their implementation of parsing and storage appear to be different.

Show me the code!

var parser = SyslogMessageParser.CreateDefault();
var rawMsg = new RawSyslogMessage() { Message = “(actual message)” };
ParsedSyslogMessage parsedMsg = parser.Parse(rawMsg);

In real life for high-volume streams things are a bit more complicated. SyslogDecode provides higher-order components that let you easily create a parallel pipeline for parsing input messages on multiple threads (like 40 or more). The pipeline has a fast, non-blocking asynchronous input queue. It merges the parsed records from all processing threads into one output stream. The output stream can be directed to storage table with structure matching the ParsedSyslogMessage class.
The library includes a UdpListener component. UDP transport (port 514) is a standard way to send syslog over networks, so this component allows you to easily setup a network listener for syslog messages.
The following diagram shows a complete Syslog processing server built entirely from components in SyslogDecode library:

The UdpListener receives the raw messages from the network and pushes them into the queue. The queue is a buffer that isolates the input feed from the timings and mechanics of multiple threads parsing the messages. The messages are grabbed from the queue in small batches (200 or so) by multiple threads actually executing the parsing. The output of all threads is merged in one stream exposed as the output IObservable<ParsedSyslogMessage> interface.
On a decent hardware, with 40 threads, the server can process up to 20–25K syslog messages per second without significant overload. To handle bigger loads you need a Load Balancer in front, distributing messages to multiple Syslog servers for processing. The queue size is a good indicator of system stress. If you see the queue growing to 100K+ buffered messages, it is a sign of overload — either up the number of threads, or add more servers. The input queue provides you a leeway, 30 seconds or more, a time to act (fire up more servers) in case sudden spike in input volume.

We provide a demo app that sets up the server like that, and then sends a large set of test messages to the UDP port.
Another useful app in the repo — TestLoadApp. It allows you to load-test a syslog server — it replays a so-called pcap file containing a sample of a real traffic captured by a tool like WireShark.

Storage and Querying

Under the hood — challenges in parsing syslog messages

Syslog parser works in a try/fail/try-other manner, trying to detect patterns, recovering when failed and trying something else. It also implements extractors for IPv4 and IPv6 values — based purely on pattern of digit/dot/colon sequences. It finds IP addresses even inside unstructured text part of the message (like the one shown before), and puts the found values into separate entries in the Data dictionary— where the query engine can retrieve them and match with some searched value.

Structurally, the Syslog parser contains multiple “variant parsers” for different patterns of syslog messages; These variant parsers are called one-by-one to attempt to parse the message. The variant parser either parses the messages, or returns “not my type”, so the process continues to the next variant parser until one succeeds. The list of variant parsers is extendable, so you can add your own variant.

There are also Value Extractors — these operate on already parsed message, they try to identify “interesting” values based on some internal pattern — like IP addresses, and put them into a separate slot. The list of extractors is extendable as well.

Links

  1. Free syslog servers: https://www.ittsystems.com/best-free-syslog-server-windows/
  2. Syslog Servers: https://www.comparitech.com/net-admin/best-free-syslog-servers-for-linux-and-windows/

.NET Developer with 20+ years of experience. Open source: Irony (parsing engine), VITA (.NET ORM), NGraphQL (GraphQL engine). @Microsoft, Cloud Security

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store