J'ai un problème suivant.Optimisation de la requête sql pour se débarrasser du problème d'espace disque
J'ai trois tables 1.) Tous les utilisateurs de table historique qui enregistre toutes les instances jamais exécutées 2.) Tous tableau d'instances qui enregistre tous les cas jamais exécuter 3.) table instances où je peux identifier les instances actives
Mon but est d'obtenir tous les utilisateurs qui font partie d'une instance active. Le problème est que la table de tous les utilisateurs contient 137 milliards d'enregistrements de sorte qu'il est impossible de le joindre avec une requête.
Ma meilleure requête à ce jour:
SELECT allcontact.users FROM allcontact
WHERE EXISTS
(
SELECT 1
FROM allinstances
WHERE allinstances.instances = allcontact.instances
AND EXISTS
(
SELECT 1
FROM activeinstances
WHERE 1 = 1
AND activeinstances.end_date> CURRENT_TIMESTAMP
AND activeinstances.run_id = allinstances.run_id
AND activeinstances.run_date = allinstances.run_date))
QUALIFY ROW_NUMBER() OVER (PARTITION BY allcontact.users ORDER BY allcontact.users DESC)=1
Actuellement, il travaille avec la logique suivante. Il vérifie toutes les exécutions où end_date est plus grand que la date actuelle, puis il prend toutes les instances de la table allinstance où ces conditions sont remplies. Cependant, cette requête se termine par un problème d'espace de spoule. La raison pour laquelle je dois faire cela est qu'une seule fois peut contenir des instances qui ne sont pas présentes dans la table activeinstances, donc je dois prendre toutes les exécutions basées sur la date d'exécution et run_id et trouver celles de toutes les instances. Cette requête me donne des résultats corrects mais je suis seulement capable de l'exécuter si je réduis le nombre de résultats que je ne peux pas faire sur la production finale.
Je suis capable de l'exécuter si je crée une table volatile avec toutes les instances et la rejoins ensuite avec la table allcontact. Cependant, dans le produit final où cette requête devrait aller, je ne suis pas capable de créer des tables volatiles.
Si quelqu'un peut proposer des suggestions sur la façon d'exécuter cela avec une requête, je serais reconnaissant.
L'environnement est la campagne IBM qui se trouve au-dessus des teradata.
Merci!
EDIT Ajouté plus de contenu
clés primaires: allcontact Table PK: cntct_id
allinstances Table PK: Les instances
activeinstances tableau PK: instances
Plan Expliquer :
SELECT allcontact.users FROM allcontact AS cntct
WHERE EXISTS (SELECT 1 FROM allinstances WHERE allinstances.instances = allcontact.instances AND EXISTS(SELECT 1 FROM activeinstances WHERE 1 = 1 AND activeinstances.end_date > CURRENT_TIMESTAMP AND activeinstances.run_id = allinstances.run_id AND activeinstances.run_date = allinstances.run_date)) QUALIFY ROW_NUMBER() OVER (PARTITION BY allcontact.users ORDER BY allcontact.users DESC)=1;
This query is optimized using type 2 profile cp_rowkey, profileid
10006. 1) First, we lock ACTIVEINSTANCES for access, we
lock ALLCONTACT in view allcontact for
access, and we lock allinstances for access. 2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from allinstances
by way of an all-rows scan with a condition of (
"allinstances.TRTMNT_TYPE_CODE <> 'I'") into
Spool 3 (all_amps), which is redistributed by the hash code
of (allinstances.RUN_DATE,
allinstances.RUN_ID) to all AMPs. The size of
Spool 3 is estimated with low confidence to be 4,612,364 rows
(119,921,464 bytes). The estimated time for this step is
0.50 seconds.
2) We do an all-AMPs RETRIEVE step from
ACTIVEINSTANCES by way of an all-rows
scan with a condition of (
"(CAST((ACTIVEINSTANCES.END_DATE)
AS TIMESTAMP(6) WITH TIME ZONE))> TIMESTAMP '2017-08-28
01:55:35.110000+00:00'") into Spool 4 (all_amps), which is
redistributed by the hash code of (
ACTIVEINSTANCES.RUN_DATE,
ACTIVEINSTANCES.RUN_ID) to all AMPs.
Then we do a SORT to order Spool 4 by row hash and the sort
key in spool field1 eliminating duplicate rows. The size of
Spool 4 is estimated with no confidence to be 132,623 rows (
4,907,051 bytes). The estimated time for this step is 0.01
seconds.
3) We do an all-AMPs RETRIEVE step from ALLCONTACT
in view allcontact by way of an all-rows scan
with no residual conditions into Spool 5 (all_amps) fanned
out into 17 hash join partitions, which is built locally on
the AMPs. The input table will not be cached in memory, but
it is eligible for synchronized scanning. The size of Spool
5 is estimated with high confidence to be 138,065,479,155
rows (3,451,636,978,875 bytes). The estimated time for this
step is 1 minute and 19 seconds. 3) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of an
all-rows scan, which is joined to Spool 4 (Last Use) by way of an
all-rows scan. Spool 3 and Spool 4 are joined using a single
partition inclusion hash join, with a join condition of (
"(TRTMNT_TYPE_CODE NOT IN ('I')) AND ((RUN_DATE =
RUN_DATE) AND (RUN_ID = RUN_ID))"). The result goes into
Spool 7 (all_amps), which is redistributed by the hash code of (
allinstances.INSTANCES) to all AMPs. Then we do a
SORT to order Spool 7 by the sort key in spool field1 eliminating
duplicate rows. The size of Spool 7 is estimated with no
confidence to be 496,670 rows (12,416,750 bytes). The estimated
time for this step is 9.84 seconds. 4) We do an all-AMPs RETRIEVE step from Spool 7 (Last Use) by way of
an all-rows scan into Spool 6 (all_amps) fanned out into 17 hash
join partitions, which is duplicated on all AMPs. The size of
Spool 6 is estimated with no confidence to be 1,862,512,500 rows (
46,562,812,500 bytes). 5) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of an
all-rows scan, which is joined to Spool 6 (Last Use) by way of an
all-rows scan. Spool 5 and Spool 6 are joined using a inclusion
hash join of 17 partitions, with a join condition of ("INSTANCES =
INSTANCES"). The result goes into Spool 2 (all_amps), which is
built locally on the AMPs. The size of Spool 2 is estimated with
no confidence to be 34,652,542,903 rows (797,008,486,769 bytes).
The estimated time for this step is 23.71 seconds. 6) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 12 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 1 (group_amps), which is built locally on the AMPs.
The size is estimated with no confidence to be 650,694,038 rows (
24,075,679,406 bytes). 7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request. -> The contents of Spool 1 are sent back to the user as the result of
statement 1.
BEGIN RECOMMENDED STATS FOR FINAL PLAN->
-- "COLLECT STATISTICS COLUMN (RUN_ID ,RUN_DATE) ON
ACTIVEINSTANCES" (High Confidence)
-- "COLLECT STATISTICS COLUMN (CAST((END_DATE) AS
TIMESTAMP(6) WITH TIME ZONE)) AS
ACTIVEINSTANCES ON
ACTIVEINSTANCES" (High Confidence)
<- END RECOMMENDED STATS FOR FINAL PLAN
La requête qui fonctionne actuellement:
SELECT Distinct t.users
FROM
(SELECT users, instances FROM allcontacts
JOIN
(SELECT DISTINCT Run_dt
FROM activeinstances
WHERE activeinstances.end_date> Cast(Current_Timestamp AS TIMESTAMP)
) AS drv on drv.Run_dt = allcontacts.run_dt) as t
JOIN
(
SELECT DISTINCT allinstances.instances
FROM allinstances
JOIN (SELECT DISTINCT run_date, run_id
FROM activeinstances
WHERE activeinstances.end_date> Cast(Current_Timestamp AS TIMESTAMP)
) AS activeinstances
ON activeinstances.run_id = allinstances.run_id
AND activeinstances.run_date = allinstances.run_date
) AS dt
ON dt.instances = allcontact.instances
Pouvez-vous ajouter DDL & PKs/FKs, plus Expliquer de la requête en cours? – dnoeth
Bonjour @dnoeth a ajouté plus de contenu pour cela. – puputtiap
Pouvez-vous ajouter 'allinstances.instances = activeinstances.instances' au plus interne EXISTS? Quel est le type de données de 'activeinstances.end_date', DATE ou TIMESTAMP? Quel est le nombre réel de lignes par rapport au nombre estimé de 132 623 pour 'activeinstances.end_date> CURRENT_TIMESTAMP'? Btw, vos PK sont probablement les index primaires, pas les clés primaires logiques ... – dnoeth