AI War 2:Discussion Of Multiplayer Desyncs And Primary Keys
Q&A Items
The goal here is for other programmers or modders to be able to read some of the discussions that are had, and gain some understanding of this complicated topic from it. A more formal writeup would be great, and is something we will do, but these illustrate various use cases and their particular pitfalls.
Can I Add A Planet Without A GameCommand?
Badger: Oh, re: multiplayer and gamecommands, does the "Create new planet" mechanic need to go in a GameCommand rather than in the stage3 sim code?
Chris: Ah... technically not, although you may want to for multiplayer and single-player purposes. Well... actually maybe yes, even so, for some edge cases.
All of the "main thread" stages 1, 2, and 3 all happen on all of the clients and the host, so anything you do in there is going to be okay for multiplayer if it always happens the same way. So the simple answer is that creating a planet should be absolutely A-OK during this, and it should happen just fine on all clients.
BUT.
There's no guarantee which of the main thread stage 1s, 2s, and 3s will run in which order. You are guaranteed that all the stage 1s run, then all the stage 2s, then all the stage 3s. But if I have one core and you have 16, my computer will do them all much more slowly and in sequential-ish random-ish order, while yours will do 16ish of them at once in a race. If ANYTHING you do in those threads changes the immediate game state (create a planet, create a unit, alter health of a unit, whatever else) in a way that would affect the outcome of another thread, then you've introduced a desync. The massive disparity in cores will make it more obvious and frequent, but even if we both had 16 cores or 4 cores or 1 core, we'd have threads racing each other and executing in random order.
The saving grace is that this game knows that stuff like that will happen, and it can recover from it without too much trouble. But the more things like that we have, the more the network bogs down with correction-code. Running correction-code for an entire planet is going to be potentially quite heavy. Running it for all the various units that might pathfind differently if the planet is added a few ms earlier or later is just going to compound things a bit. This is assuming that all the planets being added aren't in a cul-de-sac, but even if they are there might be some fireteams that want to go over there.
Which actually brings me to single-player oddities that might happen. If you add some planets in code like that, and then populate them via gamecommands (that will be a must to make sure primary key IDs from the units are all identical on the host and all clients in multiplayer), then what now structurally happens is that even in single-player you might have a gap of... let's say up to 400ms where these new planets are absolutely un-owned by anyone, and look empty. So a quick-thinking faction of some sort may start making a beeline for the new empty territory, all sorts of decisions might happen in that period, and then they look very foolish and get into a big scuffle as soon as the new occupants arrive 400ms later while the other faction is still committed to this new destination that is suddenly dangerous.
In short, putting in the planet and all the stuff to go on it as part of a single SET of gamecommands (aka issued during the same 100ms "turn" would be a really good idea. There are still likely to be some edge cases where bits populate in 400ms too late if it's not literally all in one gamecommand in some fashion, but that is probably unavoidable. I guess the optimal thing from a sync standpoint would be one command that creates and populates the planet, though.
Backing up to less common cases that are still problematic, let's suppose we have two factions adding new planets at the same time as some sort of coincidence. Right now only one faction can even do that, so we think of that as impossible. But via mods or future DLC or whatever we might have more factions that can do it all at once. If we did, and both factions are creating planets at once during the same sim frame, then you might wind up with differences between clients and hosts on THAT basis, too. Faction A creates 4 planets named b, c, e, and d, numbered 10, 11, 12, and 13. Faction N creates 1 planet named z, and numbered 14. Except because these are multiple threads, on another computer planet z might be numbered anywhere from 10-14, thus throwing off the numbering of all the other planets.
After such a severe mashup, any future gamecommands talking about "planet 14" are going to be incredibly out of sync, and lots of really hard-to-fix data is going to be super duper wrong. The sort of sync-fix code that I'm writing would never account for that sort of data being wrong, because it's going to assume that primary keys -- and by extension planet indexes and faction indexes -- are gospel. So instead of it getting fixed via sync, probably all of the ships will be corrected to match the host, but the planet ownership will be a bit off until you stop and reload the game. I mean, I was planning on having it re-sync planet ownership also, so that part would be fine... but there's all this custom backend code for factions where factions think they own this and that and keep track of "their" planets. There is no generalized way for sync code to see differences there or fix them, so that's the part that would persist in an incorrect way.
So! All of that is to say, anything that creates a new primary key ID, something similar, without going through a gamecommand and getting a sanctioned ID from the host, is likely to create an irreversible desync that requires saving, loading, and reconnecting. A shot would be not a big deal because those are short-lived and the sync stuff would fix it up. Ships would also be found and fixed... but all that metadata if it is owned by another faction would be wrong for a very long time and very problematic potentially. Wormholes or planets or factions would actually be the worst cases, as other than broken mods or bad-acting dev code there's no way those could be wrong and thus no reason to check these things overly frequently.
So... after a lot of thought, and contrary to my initial thoughts: the end answer is yes, we absolutely need to do this by gamecommand. It can be done by Stage3 if you want, but it needs to only be run on the host (there's a flag for that, or a flag for !client, either way), since we don't want 4 copies of the commands if there are 4 human players connected. BUT if you use Stage3 instead of long term planning, you will actually have a secondary problem with Rand(). If you use Rand() at all in a main thread that is supposed to be identical on all the machines, but some of those Rand() calls are inside hostonly or clientonly code, then the values that further calls to that Rand() will all be inconsistent on that thread on that frame. All of that should be recoverable by the sync code, but I'd rather not spike the network with that happening.
If you need to use Rand() and only do so on the host, then you can either use Engine_Universal.PermanentRandom (normally a huge no-no, but since it is non-synchronous it actually is fine here), or you can create a new random to use temporarily for just this purpose (kind of a waste, honestly), or you can just use the long-term planning (that may be inconvenient for code organization, or too slow to come around since those are considered long-term actions that may have a few seconds between them). Suffice it to say, if you are in one of the main thread things (stage3 or otherwise) and you're bracked into hostonly or clientonly code, just be sure not to use Context.Random or its variants or that's a desync.
This is so lengthy and so full of the general thought process behind everything that I think I'll put this on the wiki. ;)