Scalability and limits #1395

atombender · 2021-12-03T15:10:38Z

atombender
Dec 3, 2021

I'm trying to find out how well Seq scales to large data sizes. We are currently logging roughly 2TB/day (about 85 GB per hour) to Kibana. I can't find any documentation on Seq's limitations or expected performance in such scenarios.

The indexing page says:

Indexing won't be applied to any data written in the last hour: this prevents churn caused by events arriving slightly late, or when the originating application is on a machine with clock drift

Does this really mean that in our case it would log 85GB to RAM before flushing?

I would also like to know more about exactly how Seq indexes the data. If I do a search such as module = 'ingest' and elapsed > 1000, how does the query planner break down this search? How are the fields indexed? Does Seq us columnar storage?

The documentation also seems to indicate that Seq does not have replication. So it sounds like Seq requires that all data fit on a single node, and queries are never sharded or distributed?

nblumhardt · 2021-12-03T21:01:49Z

nblumhardt
Dec 3, 2021
Maintainer

Hi Alexander, thanks for dropping us a line!

Unless your retention period is very very short, 2 TB/day will be uncomfortable on a single Seq node. The current version does not scale out across nodes, as you noted, though we're actively working on this. If you have existing infrastructure that's using Kibana for metrics-heavy visualization, Seq may not be a great fit - Seq's strength is in diagnostics with structured application logs, where levelling and filtering mean many more use cases come in well under multiple TB/day rates of ingest.

RE your other questions, though - data that hasn't been indexed will still be written to disk - the 85 GB won't be held in RAM only. (Seq separately buffers recent data in RAM to speed up queries until indexing is applied.)

Indexing in Seq is applied through signals only (it's not inferred by the query planner), so to speed up your module = 'ingest' and elapsed > 1000 you might create and activate a signal with the filter module = 'ingest' before running the elapsed > 1000 query on top of that, or if elapsed is only sparsely present on events from ingest the signal might look like module = 'ingest' and elapsed is not null.

Seq and Kibana have some areas of overlap, but they're quite different products, so things might seem quite awkward mapping one across onto the other; I'd still be interested to dig in and understand your use case better, if you think there might be some value in exploring it further (I'm [email protected]). Hope this helps!

0 replies

atombender · 2021-12-05T14:21:58Z

atombender
Dec 5, 2021
Author

Thanks for responding. To be clear, we are mostly not using Kibana for metrics, only structured logging. Metrics go in Prometheus, and we only extract a couple of things from logs, which also feed into Prometheus.

We are looking to replace Kibana for structure logging. That is, doing ad-hoc queries against logs for diagnostic purposes. This means the variables (log fields) can be anything, and the user wants to be able to cross-cut against any dimension at query time and get great performance while doing so. Does this match the primary use case of Seq?

I'm not sure I understand Seq's indexing. Does this mean that when you create a "signal", Seq goes and retroactively indexes all the data matching that signal? Doesn't that mean there will be a delay unless you assiduously pre-index by creating signals for things like one per application? (We have a few dozen microservices and many cross-cutting fields like host names, shards, etc.)

And how will it perform if you don't have a signal?

For example, let's say we are experiencing an application problem. A typical Kibana query a user might run would be something like:

kubernetes.namespace:foo
  and (kubernetes.labels.app:a or kubernetes.labels.app:b)
  and not (message:"loading foo*" or message:"discovered object of type" or http.host:fnord*)

Then the user would typically start filtering out noise (like irrelevant debug messages), tweak the timeframe, etc. to get at the problem. Often, Kibana's "View surrounding documents" is super useful to get a window across all all log statements, which can then be filtered filtered to eliminate noise.

Being able to chart the data more easily is something we'd also like. Kibana is frankly awful at this, and Seq's ability to easily view any query as a chart looks really useful. Same goes for exporting slices of the log.

Edit: Sounds like Seq wouldn't actually be able to deal with the load right now. Do you have a planned release timeline for clustering?

3 replies

nblumhardt Dec 7, 2021
Maintainer

Thanks for your reply! That's great to know.

Seq's primary use case is the kind of ad-hoc diagnostic analysis you're talking about 👍 . "Great performance" will depend on the kind of query you want to run - every scheme for doing this involves some cases that are handled better than others 🙂.

Does this mean that when you create a "signal", Seq goes and retroactively indexes all the data matching that signal?

Yes, that's right.

Doesn't that mean there will be a delay unless you assiduously pre-index by creating signals for things like one per application?

Yes; signals tend to be the long-lived, slow moving things like microservices, applications, environments, and so-on. The seqcli signal create command-line makes automating signal creation reasonably efficient, so for instance you might automate this as part of your standard service deployment process.

And how will it perform if you don't have a signal?

Each signal in Seq restricts the search space, so if you combine the signals "service 1", "transaction timeouts", and "production", further queries on top of those signals will search only as much data as exists in the intersection of those signals. If you don't have a signal to cut the search space, you'll scan all data within the target time period, which can definitely get slow.

A typical Kibana query a user might run would be something like:...

In Seq this would probably be activating the signals "foo", "app a" or "app b", and "quiet" (where "quiet" would cut out the uninteresting messages on your third line).

the user would typically start filtering out noise

It's the same in Seq, you can exclude events by type, by message contents, field value, etc.

Kibana's "View surrounding documents" is super useful

Seq has this also, in the Event drop-down on a single log event, you'll see "Search +/- 5 seconds", "+/- 30 seconds", and "+/- 5 minutes".

Being able to chart the data more easily is something we'd also like. Kibana is frankly awful at this, and Seq's ability to easily view any query as a chart looks really useful. Same goes for exporting slices of the log.

👍 😎

Do you have a planned release timeline for clustering?

We finally shipped the first multi-node configuration (DR using two-node replication) in 2021.3, so we think the time frame for clustering is becoming a bit clearer, but it's still tricky to nail it down any more tightly that "late 2022" at this stage.

Thanks for all the feedback/info. Hope you find something that fits your requirements nicely - and please get in touch if you have any more questions about Seq!

atombender Dec 7, 2021
Author

Thanks, very helpful. So I guess the upshot is that Seq is good at ad-hoc queries, permitted that you predefined the right signals.

It also sounds like this might be an operational risk: Let's say you have a new app that already accrued tons of data, but you forgot to add a signal. The moment you need to access the logs, it might take a significant amount of time to index. So it becomes very important to ensure that you've always defined the right signals.

liammclennan Dec 9, 2021
Maintainer

Seq values being able to search in whatever way is needed to understand what is happening in your systems, so I expect users will add signals as they discover a need for them. The challenge you've described, adding an index to tons of data, is common to database systems. The alternative is to index everything, which has its own problems.

It is also worth noting that you can always cut the search space by time. If you can slice the window down to a few million events then ad-hoc queries should be reasonable without any indexes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability and limits #1395

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Scalability and limits #1395

atombender Dec 3, 2021

Replies: 2 comments · 3 replies

nblumhardt Dec 3, 2021 Maintainer

atombender Dec 5, 2021 Author

nblumhardt Dec 7, 2021 Maintainer

atombender Dec 7, 2021 Author

liammclennan Dec 9, 2021 Maintainer

atombender
Dec 3, 2021

Replies: 2 comments 3 replies

nblumhardt
Dec 3, 2021
Maintainer

atombender
Dec 5, 2021
Author

nblumhardt Dec 7, 2021
Maintainer

atombender Dec 7, 2021
Author

liammclennan Dec 9, 2021
Maintainer