SyslogDecode — a new open source library from Microsoft for processing Syslog messages

Roman Ivantsov

10 min readMay 3, 2021

(with contributions by Georgi Chkodrov)
Github repo: https://github.com/microsoft/SyslogDecode

What is Syslog?

- it is a kind of log …

No, not this kind. Like this one:

Syslog is a logging protocol. Basically, syslog is a stream of text messages, somewhat structured.
So - what’s the big deal?- you may ask. Just a text log, one of many.
Well, for one thing, Syslog is really important, because it is ubiquitous — it is almost everywhere.
On Linux — the syslogd (lately rsyslog or syslog-ng) is the most common logger for Linux and Unix. Syslog is also widely used by networking devices (routers, switches) that don’t have their own UI or log storage — they just send syslog messages over UDP/514 to the target listening server.
Likely quite a few Syslog messages were produced while this Web page that you read now was built and delivered to your browser.

Producing syslog messages is easy, that’s what made it so popular. The problems come on collection side at enterprise scale — when you have many, many devices/apps producing syslog, and you need to collect them at the central syslog server and save in a persistent storage of some kind. And not only store, but make it searchable, so that you can find this needle in a haystack.

There are syslog servers available in the market — both free and commercial products (see links at the end of the article). But we did not find them satisfactory for our search requirements at our scale.
This article gives a brief overview of Syslog, the challenges it brings in enterprise, and the solution we share.

Who we are

My team at Microsoft is in charge of collecting and storing syslog and other logs coming from millions of machines, apps, services and devices in Azure. The inflow rate is around 100K+ messages per second Syslog alone, and keeps increasing. This results in 7+ billion messages — every day.
Our target users are security analysts. When they try to find something that might have happened months ago — we are talking about searching over trillions of messages. The problem is not storage — it is cheap nowadays. The problem is the efficient search. At this scale, we need to parse the messages and store them in a structured way — so that values can be indexed in the database.
We had to come up with a solution — and that’s how SyslogDecode came to be. We built it and now share it — as part of Microsoft’s ongoing commitment to open source and sharing code and ideas with the community.

Syslog — quick overview

Each Syslog message is a text, usually with a simple structure. Here is an example:

<14>Mar 3 20:27:49 Corp-Server-X2 snmp#supervisord: WARNING: Invalid mgmt IP 22.111.33.44,2234:23d5:e0:25a3::55

Just by looking at the message, we can guess some elements — it contains a timestamp, probably a machine name (host), and a textual warning at the end; there are 2 IP addresses inside the text. So the structure is like:

<PRI> TimeStamp HostName ProcId: MESSAGE

The PRI segment is interesting: the integer number inside the angle brackets (14) is an encoded combination of 2 values: Facility and Severity. What these are is not important for us now, but it’s worth noting that these 2 distinct and informationally important values are hidden inside a single integer. Clearly this makes it quite challenging to search for specific Severity values in a big pile of messages.
When sending over the network, Syslog messages are usually transmitted over UDP protocol, port 514. This is a common (standard) arrangement.
And that’s it about Syslog basics — just form a message with ‘standard’ parts and send it over UDP.

So, what’s the problem?

There is no problem for the client, logger. The Syslog message structure does not seem to be complicated. If you want to start “syslogging” in your app, then it is not hard to form a “standard” message and send it to the logger.
Standard? — yes, there are 2 authoritative documents:

- RFC-3164 — overview of established practices (2001)
- RFC-5424 — actual standard (2009)

Interesting fact: RFC-5424, the actual standard, is not backward compatible with typical patterns described in RFC=3164.

These are long and detailed documents, but unfortunately…
As of today, not many folks/apps/devices follow the standard, or even pretend to follow. In our streams only around 15% of messages follow RFC-5424, the actual standard. About 50% are more-less in line with RFC-3164. The rest — it is like Wild Wild West, nobody cares about the rules.
As an example, devices from one big vendor produce messages that are just key-value pairs, without any ‘standard’ columns — they don’t look like Syslog at all. Sometimes it is in fact a sabotage of the standard — vendors want you to buy their own tools for handling syslog messages produced by their devices.

This lack of following the standard creates big problems for syslog servers —how to store and search this mess.

Storage and Search — just as text?

One way to go is to save the messages as-is — as a plain text. With modern hardware, it is actually not a problem to persist streams of millions of messages per second. For search, filtering and analysis we can try to use tools like Regexes: we apply a complex Regex to extract a value, and then apply filter to the value to select the messages we are interested in. Using Regex in queries is supported by most database vendors.

This is in fact how many existing solutions actually work — store plain text, apply Regex or some text functions at query time. However, this does not work at scale. If you have a few apps/devices and a few thousand messages per day — no problem; the only trouble is writing quite complex Regexes. But if you have billions of messages in a pile —queries hang forever.
Free-text search tools do not work either. You do not always have an exact and unique token to search for; analysts need to search for patterns and complex conditions.

SyslogDecode — Parse and index it all

Our solution is to “parse” a Syslog message upfront using a specialized custom parser, and save it in the database as a well-structured record. This is what SyslogDecode library is doing — it contains a flexible, hand-written “parser” that can dissect the Syslog message, extract values like timestamp, host, app-name, IP addresses (IPv4 and IPv6) etc., account for multiple known variations, and produce a strongly-typed ParsedSyslogMessage object.

Notice the PayloadType column — it contains the detected ‘kind’ of message structure (which RFC it follows if at all).
The parser extracts the standard elements of the message, and the values end up in dedicated columns in the database table. The values should be indexed and are therefore efficiently searchable.

Please note that we do not claim to be the first to use this parse-before-save approach, there are other Syslog software vendors that employ somewhat similar approaches, but their implementation of parsing and storage appear to be different.

Show me the code!

Here is the code to create a parser and parse a message:

var parser = SyslogMessageParser.CreateDefault();
var rawMsg = new RawSyslogMessage() { Message = “(actual message)” };
ParsedSyslogMessage parsedMsg = parser.Parse(rawMsg);

In real life for high-volume streams things are a bit more complicated. SyslogDecode provides higher-order components that let you easily create a parallel pipeline for parsing input messages on multiple threads (like 40 or more). The pipeline has a fast, non-blocking asynchronous input queue. It merges the parsed records from all processing threads into one output stream. The output stream can be directed to storage table with structure matching the ParsedSyslogMessage class.
The library includes a UdpListener component. UDP transport (port 514) is a standard way to send syslog over networks, so this component allows you to easily setup a network listener for syslog messages.
The following diagram shows a complete Syslog processing server built entirely from components in SyslogDecode library:

The UdpListener receives the raw messages from the network and pushes them into the queue. The queue is a buffer that isolates the input feed from the timings and mechanics of multiple threads parsing the messages. The messages are grabbed from the queue in small batches (200 or so) by multiple threads actually executing the parsing. The output of all threads is merged in one stream exposed as the output IObservable<ParsedSyslogMessage> interface.
On a decent hardware, with 40 threads, the server can process up to 20–25K syslog messages per second without significant overload. To handle bigger loads you need a Load Balancer in front, distributing messages to multiple Syslog servers for processing. The queue size is a good indicator of system stress. If you see the queue growing to 100K+ buffered messages, it is a sign of overload — either up the number of threads, or add more servers. The input queue provides you a leeway, 30 seconds or more, a time to act (fire up more servers) in case sudden spike in input volume.

We provide a demo app that sets up the server like that, and then sends a large set of test messages to the UDP port.
Another useful app in the repo — TestLoadApp. It allows you to load-test a syslog server — it replays a so-called pcap file containing a sample of a real traffic captured by a tool like WireShark.

Storage and Querying

The story of Syslog implementation would be incomplete without mentioning what can/should be used as a target long-term storage for parsed messages. You probably should not use plain files or blob storage — how to query stuff? And traditional relational databases are not a good choice as well — they are not created for this type of load. We need special databases built specifically for analytical/log loads.
We use Azure Data Explorer (aka Kusto), and highly recommend it. We batch the syslog messages coming from the parsing pipeline into batches (50K+) and upload them into Data Explorer. Insertion capacity is enormous, everything is indexed, always. You can use a powerful KQL query language to query the data. Try it if you did not yet. And pricing is quite reasonable.
One very neat feature of Data Explorer is a dynamic column type. It is in fact a JSON, containing a dictionary of key-value pairs. The ParsedSyslogMessage (parser output) has a Data field that holds such a dictionary — it contains all detected named values from the syslog message. In fact, some syslog messages contain only key-value pairs, no standard fields.
So the problem comes up — how to store this dictionary? Dynamic column is an ideal fit for this. What is especially nice is that accessing the keys/values inside is directly supported by the KQL query language, without any JSON-extract functions like in other servers. Just like that:
… where Data.SomeKey==”SomeValue”

Under the hood — challenges in parsing syslog messages

Yes, there are challenges. First, the term Parsing is a bit of a misnomer here. Parsing usually means processing text that is expected to follow some formal Grammar. Like parsers for programming languages — language has a grammar and program code should follow it. If not — reject it with a reasonable error message.
For Syslog there is no fixed grammar; there are 2 standards suggesting some patterns, and they are not strictly followed. Logging code in apps and devices is not always perfect. But if a Syslog message is malformed, we cannot just reject it with an error like c# compiler does. We have to proceed and extract what we can. So our parsing is more like a guessing game — find the structure based on a few patterns and try to extract something meaningful.

Syslog parser works in a try/fail/try-other manner, trying to detect patterns, recovering when failed and trying something else. It also implements extractors for IPv4 and IPv6 values — based purely on pattern of digit/dot/colon sequences. It finds IP addresses even inside unstructured text part of the message (like the one shown before), and puts the found values into separate entries in the Data dictionary— where the query engine can retrieve them and match with some searched value.

Structurally, the Syslog parser contains multiple “variant parsers” for different patterns of syslog messages; These variant parsers are called one-by-one to attempt to parse the message. The variant parser either parses the messages, or returns “not my type”, so the process continues to the next variant parser until one succeeds. The list of variant parsers is extendable, so you can add your own variant.

There are also Value Extractors — these operate on already parsed message, they try to identify “interesting” values based on some internal pattern — like IP addresses, and put them into a separate slot. The list of extractors is extendable as well.