Paying mintterd's Technical Debt

Problem

Many users have reported performance issues with the app, varying from being mildly annoying to making the app completely useless. It's very hard to identify the single root cause of the problem, probably because there isn't one, as it all depends on the combination of different factors.
The severity of the issues depends primarily on 2 aspects: the number of trusted accounts, and the specs of the actual computer running the app.
The number of trusted accounts affects the performance because of the way our syncing works.
We periodically try to resolve the peer IDs of those peers we want to sync with, then we attempt to dial the resolved IPs, and then we try to sync with them.
We've made some improvements on this front by introducing concurrency limits for peer routing, along with the backoff delay when retrying failed routing attempts. But because the backoff is not being persisted across app restarts, every time the app starts there's a huge spike in resource consumption until the backoff delay gets long enough to smooth it out.
Even with those measures in place we often end up having more than a 1000 open connections at a time when doing peer routing. DHT peer routing is not very efficient in IPFS at the moment, so we may need to find some alternative strategies mid-term (like gossiping peer records).
The computer specs is a bit of a mix bag.
Specs seem to be less important when we talk about modern computers, with moderate RAM and at least an SSD drive. And looks like my idea that HDD computers are a thing of the past (heck, even my parents have an SSD in their pretty old computer) was wrong. Even among our very small number of users we have some people with HDDs.
We target communities, and apparently the fact that at least one person in any given community would have an HDD is actually a law of physics now :) This means that the entire community won't adopt Mintter even if it's only one person that can't use the app.
HDDs are very slow compared to SSDs, and our app uses SQLite even for the most basic persistence needs. This is not an issues at all in SSDs, and lots of people are advocating for this kind of approach these days, but for HDDs it's getting more problematic, especially on Windows where filesystems are notoriously slower. In addition to that, our DB access patterns are not optimized at all.
All of this causes our app to do lots of disk I/O, even after the improvements made during the past weeks (we created DB indexes with better data locality, which lets SQLite use its internal page cache much better, as more relevant data can fit in one page).
Performance issues not only cause the app being slow for many users, but also make the app consume a lot of energy and drain batteries on laptops. This is a big deal for some of our laptop users, and even though this project is not directly aiming to solve energy consumption problems, we could expect improvements on this front as well. We'd have to measure it after we've taken all the obvious measures to improve performance.
Overall, given the lack of any optimizations in most of the app flows on the backend, the amount of technical debt and shortcuts that were made in the past, it's very hard to improve the performance of the app without having a more coherent strategy, which is what this project aims to describe.

Solution

Roughly, the solution can be summarized as:
    Refactor syncing.
    Optimize API calls.
    Keep more data in-memory, and reduce DB access.

Refactor Syncing

Currently syncing with sites is implemented separately from the usual periodic syncing with trusted accounts. It was a shortcut we took when implementing site groups for the first time, but there's no reason for having two separate, but nearly identical syncing flows. We should aim to unify them.
Site syncing also doesn't have any backoff delays. Although not an issue in most cases (because sites don't need peer routing), it may become an issue on constrained devices, and it causes unnecessary energy consumption.
We have lots of dead or broken sites in the database right now, and because we don't have any garbage collection, we keep syncing with those sites unnecessarily.
Users also don't have a way to selectively stop syncing with some of the sites (we sync with all the site groups our node knows about), so having a backoff delay would help smoothing out the load.
We could also have a more aggressive backoff delay, so that we get to longer intervals more quickly.
The overall architecture of the syncing scheduler needs improvement.
Instead of spawning a worker goroutine for every single peer we want to sync with, we could limit the number of goroutines to the number of currently online peers (because those are the only ones we can actually sync with).
In addition, a separate dispatcher goroutine would trigger the syncing process for those peers that need it at any given time, according to the designated interval.
The dispatcher would keep the number of worker goroutines according to the number of online peers, making sure syncing with one peer doesn't affect syncing with other peers.
We should optimize further our use of Peer Routing.
We will try to use IPFS Delegated Routing nodes to reduce our reliance on the DHT, and reduce the amount of resources necessary for managing all the DHT connections. More on that later.

Optimize List API Calls

A few API calls in mintterd used by the desktop app are implemented very poorly. Namely ListPublications and ListDrafts.
Those calls return a lot more data than necessary, don't have any pagination, and read every single change from disk, for every single document, every single time. This is very wasteful.
    We should implement pagination in those requests (NEEDS FRONTEND WORK).
    We should only return the data necessary for listing (i.e. no content).
    We should avoid reading changes from disk, and index all the necessary information into the DB. I believe we already have everything we need.

Optimize UpdateDraft Calls

Similarly to the previous point, UpdateDraft call is implemented very inefficiently. For every single request we read all the dependency changes of the draft, apply them, handle the incoming update request, etc. every single time.
The debounce interval on the frontend is very short, so this API is called a ton when you edit a document.
We should be able to keep the draft state in memory while it's being edited. Even if we save every change to the db as we do now, we will at least reduce the amount of reads happening, especially in documents with long history.

Change Compression

We use zstd compression for compressing the blob data on disk. The reference implementation of zstd is written in C++, but we're using a pure Go version of it. It works fine, but its default configuration is more suitable for servers than end-user devices, as it's using more memory than desirable.
There're ways to configure zstd to use less memory (e.g. folks at Tailscale do it here), so we should try it to see what gains can we get.
I'm not sure if we can change the configuration without recompressing all the data, but even if we need to recompress everything, it's not a big deal.

Keep More Data in Memory

As a summary of the previously mentioned points: we should reduce our database usage to keep rapidly changing data in memory, and when we do need to access the database we should optimize it by batching the queries when possible.

Delegated Routing

We rely heavily on DHT for resolving Peer IDs to IP addresses for those peers that are not sites. The default DHT peer routing is currently not very efficient, but it works surprisingly well and reliably provided that the peer is actually online an active. But trying to resolve lots of Peer IDs that are "dead" is very expensive due to the number of connections that need to be opened.
The idea of Delegated Routing is to have a "proxy" node that would do the DHT queries for you, amortizing the number of connections and round trips across multiple requests. For us it means that our app can only talk to a single delegated routing node, instead of having to open thousands of connections to the actual DHT server nodes. The information under the hood still lives in the DHT, but delegated routing makes it less resource intensive.
There're publicly available delegated routing nodes ran by Protocol Labs, and we should try to use it in our app to see whether it brings any benefits in terms of resource usage.

Rabbit Holes

Frontend work is necessary for pagination. Not sure at what point we can fit it in.

No Gos

This project is not directly focused on improving energy consumption and battery life on laptops. Although we can expect major improvements anyway.

Progress

This project will take multiple weeks. It's hard to say how many, because it needs some frontend work, and we don't know much about delegated routing yet, so it will take some time to put it in place and see if it helps in any way.
I believe splitting this project into multiple smaller ones will make it harder to follow-up, and it might be easier to track the progress in this section of the document, linking this doc in any relevant Hypertuesday document with some notes about what's going to be done on which week.
We can also track what's done right here.
    Optimized SQLite queries that select or join blobs table without needing the blob data by creating indices that only store blob metadata, so that SQLite doesn't need to read large pages with blob data causing a lot of cache misses.