Simple rejoindre sur int + tstzrange colonne très lente sur seulement ~ 1 million de lignes

Je suis aux prises avec la performance d'une requête qui implique une gauche « simple » se joindre à un int -column et tstzrange:Simple rejoindre sur int + tstzrange colonne très lente sur seulement ~ 1 million de lignes

SELECT 
     table_1.id_col 
    , table_1.time_range 
    , table_1.other_col_1 
    , table_2.other_col_2 
FROM table_1 
LEFT JOIN table_2 
ON table_1.id_col = table_2.id_col 
AND table_1.time_range = table_2.time_range

Cette requête prend ~ 80-100 secondes pour exécuter pour un ensemble de résultats final de ~ 1 million de lignes (table_1 et table_2 sont sur le même ordre)

Cette requête fait partie d'un complexe plus CTE requête (qui sélectionne en fait un petit sous-ensemble de ces 1 million lignes) mais j'ai soulevé la partie qui présente un goulot d'étranglement.

J'ai ajouté (ce que je pense) l'index approprié (GIST -index) pour la combinaison de ces colonnes, mais à partir de l'explication je suppose que cela est rejeté quand je rejoins pratiquement toutes les lignes.

Existe-t-il un moyen d'améliorer les performances? Comme trier les lignes physiquement pour le balayage séquentiel?

Mes tableaux:

CREATE TABLE data.table_1 (
    table_1_id SERIAL NOT NULL, 
    id_col INTEGER NOT NULL, 
    time_range TSTZRANGE NOT NULL, 
    other_col_1 INTEGER, 
    PRIMARY KEY (table_1_id), 
); 

CREATE INDEX idx_table_1_id_col ON data.table_1 (id_col); 
CREATE INDEX idx_table_1_time_range ON data.table_1 USING gist (time_range); 
CREATE INDEX idx_table_1_id_col_time_range ON data.table_1 USING gist (id_col, time_range); 

CREATE TABLE data.table_2 (
    table_2_id SERIAL NOT NULL, 
    id_col INTEGER NOT NULL, 
    time_range TSTZRANGE NOT NULL, 
    other_col_2 DOUBLE PRECISION, 
    PRIMARY KEY (table_2_id), 
); 

CREATE INDEX idx_table_2_id_col ON data.table_2 (id_col); 
CREATE INDEX idx_table_2_time_range ON data.table_2 USING gist (time_range); 
CREATE INDEX idx_table_2_id_col_time_range ON data.table_2 USING gist (id_col, time_range);

Voici Explain détaillée:

[ 
    { 
    "Plan": { 
     "Node Type": "Hash Join", 
     "Join Type": "Left", 
     "Startup Cost": 198185.10, 
     "Total Cost": 4163704.54, 
     "Plan Rows": 73508636, 
     "Plan Width": 20, 
     "Actual Startup Time": 31055.086, 
     "Actual Total Time": 89488.540, 
     "Actual Rows": 1015568, 
     "Actual Loops": 1, 
     "Output": ["table_1.id_col", "table_1.other_col_1", "table_2.other_col_2"], 
     "Hash Cond": "((table_1.id_col = table_2.id_col) AND (table_1.time_range = table_2.time_range))", 
     "Shared Hit Blocks": 165149, 
     "Shared Read Blocks": 632793, 
     "Shared Dirtied Blocks": 0, 
     "Shared Written Blocks": 0, 
     "Local Hit Blocks": 0, 
     "Local Read Blocks": 0, 
     "Local Dirtied Blocks": 0, 
     "Local Written Blocks": 0, 
     "Temp Read Blocks": 38220, 
     "Temp Written Blocks": 37966, 
     "I/O Read Time": 0.000, 
     "I/O Write Time": 0.000, 
     "Plans": [ 
     { 
      "Node Type": "Seq Scan", 
      "Parent Relationship": "Outer", 
      "Relation Name": "table_1", 
      "Schema": "data", 
      "Alias": "table_1", 
      "Startup Cost": 0.00, 
      "Total Cost": 1492907.36, 
      "Plan Rows": 73508636, 
      "Plan Width": 34, 
      "Actual Startup Time": 24827.453, 
      "Actual Total Time": 77143.930, 
      "Actual Rows": 904431, 
      "Actual Loops": 1, 
      "Output": ["table_1.id_col", "table_1.other_col_1", "table_1.time_range"], 
      "Shared Hit Blocks": 165147, 
      "Shared Read Blocks": 592674, 
      "Shared Dirtied Blocks": 0, 
      "Shared Written Blocks": 0, 
      "Local Hit Blocks": 0, 
      "Local Read Blocks": 0, 
      "Local Dirtied Blocks": 0, 
      "Local Written Blocks": 0, 
      "Temp Read Blocks": 0, 
      "Temp Written Blocks": 0, 
      "I/O Read Time": 0.000, 
      "I/O Write Time": 0.000 
     }, 
     { 
      "Node Type": "Hash", 
      "Parent Relationship": "Inner", 
      "Startup Cost": 88292.64, 
      "Total Cost": 88292.64, 
      "Plan Rows": 4817164, 
      "Plan Width": 34, 
      "Actual Startup Time": 6204.927, 
      "Actual Total Time": 6204.927, 
      "Actual Rows": 4817085, 
      "Actual Loops": 1, 
      "Output": ["table_2.other_col_2", "table_2.id_col", "table_2.time_range"], 
      "Hash Buckets": 65536, 
      "Original Hash Buckets": 65536, 
      "Hash Batches": 128, 
      "Original Hash Batches": 128, 
      "Peak Memory Usage": 2930, 
      "Shared Hit Blocks": 2, 
      "Shared Read Blocks": 40119, 
      "Shared Dirtied Blocks": 0, 
      "Shared Written Blocks": 0, 
      "Local Hit Blocks": 0, 
      "Local Read Blocks": 0, 
      "Local Dirtied Blocks": 0, 
      "Local Written Blocks": 0, 
      "Temp Read Blocks": 0, 
      "Temp Written Blocks": 31422, 
      "I/O Read Time": 0.000, 
      "I/O Write Time": 0.000, 
      "Plans": [ 
      { 
       "Node Type": "Seq Scan", 
       "Parent Relationship": "Outer", 
       "Relation Name": "table_2", 
       "Schema": "data", 
       "Alias": "table_2", 
       "Startup Cost": 0.00, 
       "Total Cost": 88292.64, 
       "Plan Rows": 4817164, 
       "Plan Width": 34, 
       "Actual Startup Time": 0.650, 
       "Actual Total Time": 3769.157, 
       "Actual Rows": 4817085, 
       "Actual Loops": 1, 
       "Output": ["table_2.other_col_2", "table_2.id_col", "table_2.time_range"], 
       "Shared Hit Blocks": 2, 
       "Shared Read Blocks": 40119, 
       "Shared Dirtied Blocks": 0, 
       "Shared Written Blocks": 0, 
       "Local Hit Blocks": 0, 
       "Local Read Blocks": 0, 
       "Local Dirtied Blocks": 0, 
       "Local Written Blocks": 0, 
       "Temp Read Blocks": 0, 
       "Temp Written Blocks": 0, 
       "I/O Read Time": 0.000, 
       "I/O Write Time": 0.000 
      } 
      ] 
     } 
     ] 
    }, 
    "Planning Time": 0.350, 
    "Triggers": [ 
    ], 
    "Execution Time": 89689.809 
    } 
]

Source

2017-08-09 salient

Ne pouvez-vous pas mettre des conditions où (je suppose que vous allez filtrer ces résultats plus tard) directement dans cette requête? –

@LorenzoCatalano, mais c'est fait indirectement via les conditions découlant de la CTE. J'ai essentiellement d'autres tables où des sous-ensembles de ce qui précède est joint. (Si cela a du sens) – salient

ressemble à une jointure normale, je ne peux pas dire exactement ce que la plaine dit mais je vois comme "Plan Rangs": 73508636, quoi cela signifie t-il? –

Tri des données à l'aide physique CLUSTER réduit le temps de requête jusqu'à ~ 5 secondes qui est OK, considérant que je vais en outre sélectionner un sous-ensemble des lignes:

CLUSTER table_1 USING idx_table_1_id_col_time_range; 
CLUSTER table_2 USING idx_table_2_id_col_time_range;

Source

2017-08-09 13:59:06 salient

Simple rejoindre sur int + tstzrange colonne très lente sur seulement ~ 1 million de lignes

Répondre

Questions connexes