Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete IDs of non existent documents in Engine::reindex() #472

Open
Toflar opened this issue Jan 6, 2025 · 2 comments
Open

Delete IDs of non existent documents in Engine::reindex() #472

Toflar opened this issue Jan 6, 2025 · 2 comments

Comments

@Toflar
Copy link
Member

Toflar commented Jan 6, 2025

When you pass identifiers in the ReindexConfig, the ones that do not exist anymore should get deleted.
Something along the lines of this:

Subject: [PATCH] Delete reindex identifiers that do not exist anymore
---
Index: packages/seal/src/Engine.php
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/packages/seal/src/Engine.php b/packages/seal/src/Engine.php
--- a/packages/seal/src/Engine.php	(revision 4ea343448a96c12010b2902b9c6c124c88555d88)
+++ b/packages/seal/src/Engine.php	(date 1736179896990)
@@ -140,16 +140,23 @@
     ): void {
         /** @var array<string, ReindexProviderInterface[]> $reindexProvidersPerIndex */
         $reindexProvidersPerIndex = [];
+        /** @var array<string, string> $identifiersPerIndex */
+        $identifiersPerIndex = [];
         foreach ($reindexProviders as $reindexProvider) {
             if (!isset($this->schema->indexes[$reindexProvider::getIndex()])) {
                 continue;
             }
 
+            $identifiersPerIndex[$reindexProvider::getIndex()] = $this->schema->indexes[$reindexProvider::getIndex()]->getIdentifierField()->name;
+
             if ($reindexProvider::getIndex() === $reindexConfig->getIndex() || null === $reindexConfig->getIndex()) {
                 $reindexProvidersPerIndex[$reindexProvider::getIndex()][] = $reindexProvider;
             }
         }
 
+        // Track documents that need to be deleted if an identifiers array was given
+        $documentIdsToDelete = array_flip($reindexConfig->getIdentifiers());
+
         foreach ($reindexProvidersPerIndex as $index => $reindexProviders) {
             if ($reindexConfig->shouldDropIndex() && $this->existIndex($index)) {
                 $task = $this->dropIndex($index, ['return_slow_promise_result' => true]);
@@ -164,7 +171,7 @@
             foreach ($reindexProviders as $reindexProvider) {
                 $this->bulk(
                     $index,
-                    (function () use ($index, $reindexProvider, $reindexConfig, $progressCallback) {
+                    (function () use ($index, $reindexProvider, $reindexConfig, $progressCallback, $documentIdsToDelete, $identifiersPerIndex) {
                         $count = 0;
                         $total = $reindexProvider->total();
 
@@ -172,6 +179,9 @@
                         foreach ($reindexProvider->provide($reindexConfig) as $document) {
                             ++$count;
 
+                            // Document still exists, do not delete
+                            unset($documentIdsToDelete[$document[$identifiersPerIndex[$index]]]);
+
                             yield $document;
 
                             if (null !== $progressCallback
@@ -193,5 +203,9 @@
                 );
             }
         }
+
+        if ([] !== $documentIdsToDelete) {
+            $this->bulk('???', [], $documentIdsToDelete, $reindexConfig->getBulkSize());
+        }
     }
 }

As you can see from the ??? in the patch, I was unsure about the index name. There seems to be a conceptual issue in this method because somehow you pass an array of $reindexProviders per index and also the ReindexConfig has an index. How would you know which IDs in the ReindexConfig belong to which index?

@alexander-schranz
Copy link
Member

This is expected behaviour if you want to drop existing documents you need to add --drop to the command. Special if you index only partial part of a global index you don't want to clear none received documents.

Zero downtime reindexing will be part of: #212

@Toflar
Copy link
Member Author

Toflar commented Jan 6, 2025

Hmm...yeah actually I don't need to drop the entire index, just the documents that have not been mentioned. But maybe we'll wait for #212 then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants