Je suis en cours d'exécution d'une instance Cassandra avecComment puis-je déboguer pourquoi l'interruption des threads pendant le GC a-t-elle été si longue?
$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b34)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b34, mixed mode)
et les options de collecte des ordures suivant
-Xms20000m
-Xmx20000m
-Xmn10000m
-Xss256k
-XX:+UseG1GC
-XX:MaxGCPauseMillis=1000
-XX:GCTimeRatio=97
-XX:MaxTenuringThreshold=2
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:+PerfDisableSharedMem
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintAdaptiveSizePolicy
-Xloggc:/dev/shm/cassandra-gc.log
Au cours d'un cycle de collecte des ordures, la machine virtuelle Java enregistre le
2017-08-15T17:09:08.506+0000: 71712.670: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 658505728 bytes, new threshold 2 (max 2)
- age 1: 26818496 bytes, 26818496 total
- age 2: 14940744 bytes, 41759240 total
71712.670: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 222556, predicted base time: 50.67 ms, remaining time: 949.33 ms, target pause time: 1000.00 ms]
71712.670: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1242 regions, survivors: 8 regions, predicted young region time: 12.00 ms]
71712.670: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1242 regions, survivors: 8 regions, old: 0 regions, predicted pause time: 62.68 ms, target pause time: 1000.00 ms]
71712.736: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 1.39 %, threshold: 1.02 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 0.0662411 secs]
[Parallel Time: 57.5 ms, GC Workers: 33]
[GC Worker Start (ms): Min: 71712670.9, Avg: 71712671.2, Max: 71712671.4, Diff: 0.4]
[Ext Root Scanning (ms): Min: 6.4, Avg: 7.0, Max: 10.2, Diff: 3.8, Sum: 232.3]
[Update RS (ms): Min: 34.0, Avg: 37.0, Max: 38.9, Diff: 4.9, Sum: 1221.2]
[Processed Buffers: Min: 27, Avg: 48.5, Max: 66, Diff: 39, Sum: 1599]
[Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum: 5.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Object Copy (ms): Min: 11.3, Avg: 12.6, Max: 12.9, Diff: 1.7, Sum: 415.0]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.4]
[Termination Attempts: Min: 1, Avg: 2.1, Max: 4, Diff: 3, Sum: 68]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 4.5]
[GC Worker Total (ms): Min: 56.7, Avg: 56.9, Max: 57.3, Diff: 0.7, Sum: 1878.5]
[GC Worker End (ms): Min: 71712728.0, Avg: 71712728.1, Max: 71712728.3, Diff: 0.3]
[Code Root Fixup: 0.5 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 1.1 ms]
[Other: 7.1 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 2.7 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.8 ms]
[Humongous Register: 0.1 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 1.5 ms]
[Eden: 9936.0M(9936.0M)->0.0B(9920.0M) Survivors: 64.0M->80.0M Heap: 18.2G(19.5G)->8718.4M(19.5G)]
Heap after GC invocations=8876 (full 0):
garbage-first heap total 20480000K, used 8927594K [0x00000002de000000, 0x00000002de804e20, 0x00000007c0000000)
region size 8192K, 10 young (81920K), 10 survivors (81920K)
Metaspace used 44399K, capacity 44978K, committed 45184K, reserved 1089536K
class space used 4591K, capacity 4729K, committed 4760K, reserved 1048576K
}
[Times: user=0.88 sys=0.24, real=0.07 secs]
2017-08-15T17:09:08.573+0000: 71712.737: Total time for which application threads were stopped: 4.1542217 seconds, Stopping threads took: 4.0814458 seconds
suivant
Je attirera l'attention sur la dernière ligne
Temps total pour lequel des fils d'application ont été arrêtés: 4.1542217 secondes, threads d'arrêt a: 4.0814458 secondes
Je ne peux pas dire de l'un des détails de GC imprimés pourquoi arrêt des fils ont si longtemps.
Autres (long ish) pauses ont trivialement faible temps d'arrêt de fil
2017-08-15T17:09:26.646+0000: 71730.810: Total time for which application threads were stopped: 0.1165019 seconds, Stopping threads took: 0.0005363 seconds
2017-08-15T17:10:36.982+0000: 71801.146: Total time for which application threads were stopped: 0.1444364 seconds, Stopping threads took: 0.0006902 seconds
2017-08-15T17:10:42.091+0000: 71806.255: Total time for which application threads were stopped: 0.2288663 seconds, Stopping threads took: 0.0064196 seconds
2017-08-15T17:10:52.601+0000: 71816.765: Total time for which application threads were stopped: 0.2781895 seconds, Stopping threads took: 0.0005429 seconds
2017-08-15T17:10:56.992+0000: 71821.156: Total time for which application threads were stopped: 0.1380046 seconds, Stopping threads took: 0.0006240 seconds
L'application a au moins 1000 threads, mais il fait aucun sens pour moi que de les arrêter prendrait 98% d'un événement GC de durée.
Je suppose d'abord qu'il était le temps qu'il a fallu pour un fil d'atteindre un point de restauration, donc j'ai essayé d'ajouter
-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500
mais les événements enregistrés ne coïncident pas avec les longs événements GC (à moins que les fils sont tous en prenant 499ms pour atteindre leur safepoint ...)
Quoi d'autre peut retarder le temps d'arrêt du fil?
Notez que ces pauses sont présentes si j'utilise CMS ou G1 et si je write the GC logs to disk (SSD) or to memory.
@ the8472 Merci. Ces statistiques ont montré un long temps de 'synchronisation 'qui me fait penser que mon problème est [ceci] (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057003). – Savior