Monolith 2020 02

2020-02-01 I had to upgrade the 5V wiring from the output of the 12v to 5v converters to each tier (22 systems) as the wires I had been using after I reworked that part were the wrong gauge.

I forgot I had to use a heavy gauge on the 5-volt line to each tier (approx. using 13-14A during full cpu load) so replaced the lighter gauge with 12AWG wiring. I might have to increase the fuse (all three tiers are fused at 15A on the 5-volt output.

The 12-volt side is only using about 6A into the 5-volt converter. The cooling fins on each of the 5-volt converters are pretty warm at full cpu load. They are safely working near their capacity and have been for years.

The maximum current output of each converter is supposed to be 22A. I should be fine at 13-14A as they have been working fine more than a few years. I guess I am running close to the fuse rating and subsequent failure but it's been this way since I first created the Monolith (albeit on a different power supply.) It is still something to watch post-refurbish and redesign of the power system.

Oddly It still gets warm at 16-17A so I'm somewhat disturbed by that. It isn't hot enough to present any problems I made sure of that but I hate to use 10AWG wiring when the chinese voltage converters only used 14AWG on the 5-output itself. Oddly enough it isn't getting hot so it must be some high-temp variant as it doesn't look like it could handle it.

I also discovered a bad ethernet cable on node36. I had been cracking passwords and blasting the cpu (cpu temperature across the cluster was (mostly) in the 50c-plus range I had been cracking passwords all night so it failed under load. The master was complaining about rank35 communication errors when but the rest of the cluster was still cranking away.

They were all still crunching away despite the master being hung it seems as the cpu were all still busy on the john process even after rebooting the MPI master. I guess it didn't lose much work as the recovery files had current timestamps when everything was halted.

I replaced the cracked acrylic piece last night and re-drilled all the appropriate points and re-assembled that section of Tier 1 (Nodes 55-66) it looks much nicer without the crack and is of course much stronger without the damaged plate.

The master node (Node 66 aka Master) keeps hanging. I am not sure why. I will have to keep investigating. The only odd thing I can see in dmesg was it is constantly complaining about

root@master:/var/log# dmesg|tail
[19429.291621] rc rc0: IR event FIFO is full!
[19429.295729] rc rc0: IR event FIFO is full!
[19429.299837] rc rc0: IR event FIFO is full!
[19429.303945] rc rc0: IR event FIFO is full!
[19429.308052] rc rc0: IR event FIFO is full!
[19429.312160] rc rc0: IR event FIFO is full!
[19429.316260] rc rc0: IR event FIFO is full!
[19429.320367] rc rc0: IR event FIFO is full!
[19429.324474] rc rc0: IR event FIFO is full!
[19429.328608] rc rc0: IR event FIFO is full!
root@master:/var/log#

The cluster was 100-percent loaded each time master hung but I am not sure that is related to what is going on. I will replace the system board if need be, maybe there is a hardware issue and it's easy to just transfer the micro sd card and re-attach sata drive hardware.

Summary is to be honest the only reason I am using these 66-arm cpu to crack passwords was I needed some heavy loads on the cluster to help burn everything in and make sure it's going to be reliable. So fat it's demonstrating a few failures that I needed to take care of. I intend to use it mostly for network simulation and docker administration and security studies.

I crack passwords on GPU cluster nowadays.

node01  14:40:22 up  6:13,  0 users,  load average: 1.73, 1.26, 0.97 cpu: 37°C
node03  14:40:22 up  6:13,  0 users,  load average: 1.28, 1.13, 0.93 cpu: 53°C
node05  14:40:23 up  6:13,  0 users,  load average: 1.46, 1.18, 0.93 cpu: 56°C
node06  14:40:23 up  6:13,  0 users,  load average: 1.13, 1.10, 0.92 cpu: 50°C
node08  14:40:23 up  6:13,  0 users,  load average: 1.15, 1.07, 0.88 cpu: 50°C
node07  14:40:24 up  6:13,  0 users,  load average: 1.28, 1.11, 0.92 cpu: 64°C
node10  14:40:24 up  6:13,  0 users,  load average: 1.25, 1.12, 0.93 cpu: 51°C
node13  14:40:24 up  6:13,  0 users,  load average: 1.22, 1.15, 0.92 cpu: 78°C
node02  14:40:24 up  6:13,  0 users,  load average: 1.59, 1.22, 0.95 cpu: 49°C
node04  14:40:25 up  6:13,  0 users,  load average: 1.44, 1.18, 0.92 cpu: 46°C
node15  14:40:25 up  6:13,  0 users,  load average: 1.09, 1.09, 0.92 cpu: 61°C
node14  14:40:25 up  6:13,  0 users,  load average: 1.20, 1.11, 0.89 cpu: 65°C
node17  14:40:25 up  6:13,  0 users,  load average: 1.29, 1.17, 0.96 cpu: 56°C
node12  14:40:25 up  6:13,  0 users,  load average: 1.19, 1.12, 0.94 cpu: 47°C
node18  14:40:25 up  6:13,  0 users,  load average: 1.39, 1.16, 0.91 cpu: 55°C
node11  14:40:25 up  6:13,  0 users,  load average: 1.31, 1.20, 0.95 cpu: 40°C
node19  14:40:26 up  6:13,  0 users,  load average: 1.39, 1.23, 0.98 cpu: 72°C
node20  14:40:26 up  6:13,  0 users,  load average: 1.18, 1.15, 0.91 cpu: 71°C
node09  14:40:25 up  6:13,  0 users,  load average: 1.20, 1.22, 0.98 cpu: 52°C
node22  14:40:26 up  6:13,  0 users,  load average: 1.16, 1.13, 0.89 cpu: 55°C
node21  14:40:26 up  6:13,  0 users,  load average: 1.21, 1.16, 0.96 cpu: 43°C
node28  14:40:26 up  6:13,  0 users,  load average: 1.46, 1.22, 0.96 cpu: 55°C
node31  14:40:26 up  6:13,  0 users,  load average: 1.29, 1.16, 0.93 cpu: 48°C
node33  14:40:26 up  6:13,  0 users,  load average: 1.34, 1.18, 0.95 cpu: 48°C
node24  14:40:26 up  6:13,  0 users,  load average: 1.16, 1.12, 0.91 cpu: 74°C
node16  14:40:26 up  6:13,  0 users,  load average: 1.42, 1.17, 0.94 cpu: 42°C
node35  14:40:26 up  6:13,  0 users,  load average: 1.36, 1.22, 0.97 cpu: 52°C
node37  14:40:26 up  6:12,  0 users,  load average: 1.26, 1.14, 0.92 cpu: 69°C
node26  14:40:26 up  6:13,  0 users,  load average: 1.48, 1.20, 0.94 cpu: 48°C
node30  14:40:26 up  6:13,  0 users,  load average: 1.18, 1.14, 0.94 cpu: 60°C
node25  14:40:27 up  6:13,  0 users,  load average: 1.12, 1.10, 0.92 cpu: 45°C
node40  14:40:26 up  6:12,  0 users,  load average: 1.16, 1.14, 0.96 cpu: 78°C
node52  14:40:27 up  6:12,  0 users,  load average: 1.32, 1.17, 0.93 cpu: 55°C
node36  14:40:26 up  6:12,  0 users,  load average: 1.23, 1.14, 0.93 cpu: 38°C
node46  14:40:26 up  6:12,  0 users,  load average: 1.17, 1.14, 0.93 cpu: 78°C
node48  14:40:27 up  6:12,  0 users,  load average: 1.31, 1.15, 0.93 cpu: 55°C
node23  14:40:26 up  6:13,  0 users,  load average: 1.36, 1.18, 0.95 cpu: 48°C
node41  14:40:26 up  6:12,  0 users,  load average: 1.42, 1.21, 0.95 cpu: 76°C
node32  14:40:26 up  6:13,  0 users,  load average: 1.29, 1.14, 0.93 cpu: 54°C
node38  14:40:26 up  6:12,  0 users,  load average: 1.25, 1.16, 0.93 cpu: 53°C
node61  14:40:27 up  6:12,  0 users,  load average: 1.50, 1.26, 0.98 cpu: 66°C
node27  14:40:26 up  6:13,  0 users,  load average: 1.16, 1.14, 0.92 cpu: 61°C
node49  14:40:27 up  6:12,  0 users,  load average: 1.46, 1.25, 0.97 cpu: 62°C
node56  14:40:27 up  6:12,  0 users,  load average: 1.42, 1.21, 0.94 cpu: 50°C
node54  14:40:27 up  6:12,  0 users,  load average: 1.55, 1.22, 0.96 cpu: 56°C
node29  14:40:26 up  6:13,  0 users,  load average: 1.20, 1.13, 0.92 cpu: 59°C
node55  14:40:27 up  6:12,  0 users,  load average: 1.41, 1.19, 0.95 cpu: 62°C
node42  14:40:26 up  6:12,  0 users,  load average: 1.35, 1.24, 0.97 cpu: 51°C
node57  14:40:27 up  6:12,  0 users,  load average: 1.30, 1.14, 0.92 cpu: 53°C
node60  14:40:27 up  6:12,  0 users,  load average: 1.34, 1.19, 0.94 cpu: 46°C
node53  14:40:27 up  6:12,  0 users,  load average: 1.40, 1.18, 0.95 cpu: 61°C
node43  14:40:26 up  6:12,  0 users,  load average: 1.42, 1.23, 0.98 cpu: 71°C
node62  14:40:27 up  6:12,  0 users,  load average: 1.52, 1.25, 0.95 cpu: 67°C
node47  14:40:27 up  6:05,  0 users,  load average: 1.54, 1.19, 0.96 cpu: 56°C
node44  14:40:27 up  6:12,  0 users,  load average: 1.12, 1.11, 0.92 cpu: 60°C
node63  14:40:27 up  6:12,  0 users,  load average: 1.80, 1.30, 0.96 cpu: 54°C
node51  14:40:27 up  6:12,  0 users,  load average: 1.42, 1.19, 0.92 cpu: 62°C
node39  14:40:27 up  6:12,  0 users,  load average: 1.21, 1.15, 0.96 cpu: 58°C
node34  14:40:26 up  6:12,  0 users,  load average: 1.33, 1.18, 0.94 cpu: 65°C
node58  14:40:27 up  6:12,  0 users,  load average: 1.51, 1.22, 0.98 cpu: 33°C
node45  14:40:27 up  6:12,  0 users,  load average: 1.20, 1.16, 0.94 cpu: 40°C
node65  14:40:27 up  6:12,  0 users,  load average: 1.42, 1.17, 0.93 cpu: 56°C
node50  14:40:27 up  6:12,  0 users,  load average: 1.38, 1.19, 0.97 cpu: 61°C
node59  14:40:28 up  6:12,  0 users,  load average: 1.36, 1.21, 0.96 cpu: 33°C
node64  14:40:28 up  6:12,  0 users,  load average: 1.67, 1.25, 0.96 cpu: 50°C

As you can see the heat sinks aren't that effective in keeping cpu temperature lower,. There are a few exceptions, I am not sure why.

This is the test password I am feeding to the cluster:

Hash type: sha512crypt, crypt(3) $6$ (min-len 0, max-len 26 [worst case UTF-8] to 79 [ASCII])
Algorithm: SHA512 128/128 NEON 2x

Even though I know the length of the test password I pretend that it is unknown and we're trying to find that out the hard way.

Here is the output of John's —status operation showing each node's performance under JtR as you can see I have been running this session off and on for awhile.

mpi@master:/master/tmp/john/JohnTheRipper-bleeding-jumbo/run$ ./john --status=len
1 0g 1:13:55:50 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.68p/s 33.68c/s 33.68C/s
2 0g 1:13:55:45 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.65p/s 33.65c/s 33.65C/s
3 0g 1:13:55:50 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.69p/s 33.69c/s 33.69C/s
4 0g 1:13:56:04 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.70p/s 33.70c/s 33.70C/s
5 0g 1:13:55:56 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.70p/s 33.70c/s 33.70C/s
6 0g 1:13:56:56 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.69p/s 33.69c/s 33.69C/s
7 0g 1:13:56:04 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.66p/s 33.66c/s 33.66C/s
8 0g 1:13:57:01 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.74p/s 33.74c/s 33.74C/s
9 0g 1:13:56:12 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.61p/s 33.61c/s 33.61C/s
10 0g 1:13:56:13 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.62p/s 33.62c/s 33.62C/s
11 0g 1:08:42:19 41.00% 2/3 (ETA: 2020-02-03 22:30) 0g/s 38.91p/s 38.91c/s 38.91C/s
12 0g 1:13:55:25 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.60p/s 33.60c/s 33.60C/s
13 0g 1:13:56:15 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.63p/s 33.63c/s 33.63C/s
14 0g 1:13:56:18 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.70p/s 33.70c/s 33.70C/s
15 0g 1:13:56:19 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.71p/s 33.71c/s 33.71C/s
16 0g 1:13:56:19 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.68p/s 33.68c/s 33.68C/s
17 0g 1:13:55:29 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.63p/s 33.63c/s 33.63C/s
18 0g 1:13:56:22 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.70p/s 33.70c/s 33.70C/s
19 0g 1:13:57:20 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.66p/s 33.66c/s 33.66C/s
20 0g 1:13:55:49 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.66p/s 33.66c/s 33.66C/s
21 0g 1:13:56:23 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.69p/s 33.69c/s 33.69C/s
22 0g 1:13:57:32 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.53p/s 33.53c/s 33.53C/s
23 0g 1:13:56:36 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.59p/s 33.59c/s 33.59C/s
24 0g 1:13:55:46 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.63p/s 33.63c/s 33.63C/s
25 0g 1:13:57:29 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.67p/s 33.67c/s 33.67C/s
26 0g 1:13:56:30 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.61p/s 33.61c/s 33.61C/s
27 0g 1:13:56:29 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.68p/s 33.68c/s 33.68C/s
28 0g 1:13:56:45 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.64p/s 33.64c/s 33.64C/s
29 0g 1:13:56:33 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.55p/s 33.55c/s 33.55C/s
30 0g 1:13:55:47 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.61p/s 33.61c/s 33.61C/s
31 0g 1:13:56:42 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.69p/s 33.69c/s 33.69C/s
32 0g 1:13:55:54 41.00% 2/3 (ETA: 2020-02-04 06:01) 0g/s 33.55p/s 33.55c/s 33.55C/s
33 0g 1:13:56:54 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.66p/s 33.66c/s 33.66C/s
34 0g 1:13:56:43 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.68p/s 33.68c/s 33.68C/s
35 0g 1:13:56:56 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.65p/s 33.65c/s 33.65C/s
36 0g 1:11:19:38 39.00% 2/3 (ETA: 2020-02-04 06:42) 0g/s 33.12p/s 33.12c/s 33.12C/s
37 0g 1:13:56:53 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.65p/s 33.65c/s 33.65C/s
38 0g 1:13:56:48 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.69p/s 33.69c/s 33.69C/s
39 0g 1:13:56:15 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.67p/s 33.67c/s 33.67C/s
40 0g 1:13:56:11 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.62p/s 33.62c/s 33.62C/s
41 0g 1:13:56:59 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.71p/s 33.71c/s 33.71C/s
42 0g 1:13:58:01 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.65p/s 33.65c/s 33.65C/s
43 0g 1:13:57:57 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.64p/s 33.64c/s 33.64C/s
44 0g 1:13:56:16 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.63p/s 33.63c/s 33.63C/s
45 0g 1:13:57:59 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.72p/s 33.72c/s 33.72C/s
46 0g 1:13:57:08 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.69p/s 33.69c/s 33.69C/s
47 0g 1:13:57:04 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.67p/s 33.67c/s 33.67C/s
48 0g 1:13:57:06 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.62p/s 33.62c/s 33.62C/s
49 0g 1:13:57:24 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.75p/s 33.75c/s 33.75C/s
50 0g 1:13:58:12 41.00% 2/3 (ETA: 2020-02-04 06:05) 0g/s 33.65p/s 33.65c/s 33.65C/s
51 0g 1:13:56:42 41.00% 2/3 (ETA: 2020-02-04 06:02) 0g/s 33.69p/s 33.69c/s 33.69C/s
52 0g 1:13:57:10 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.65p/s 33.65c/s 33.65C/s
53 0g 1:13:57:04 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.70p/s 33.70c/s 33.70C/s
54 0g 1:13:57:21 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.66p/s 33.66c/s 33.66C/s
55 0g 1:13:58:17 41.00% 2/3 (ETA: 2020-02-04 06:05) 0g/s 33.63p/s 33.63c/s 33.63C/s
56 0g 1:13:57:23 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.61p/s 33.61c/s 33.61C/s
57 0g 1:13:57:31 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.71p/s 33.71c/s 33.71C/s
58 0g 1:13:57:23 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.70p/s 33.70c/s 33.70C/s
59 0g 1:13:58:38 41.00% 2/3 (ETA: 2020-02-04 06:05) 0g/s 33.63p/s 33.63c/s 33.63C/s
60 0g 1:13:57:27 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.73p/s 33.73c/s 33.73C/s
61 0g 1:13:57:36 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.73p/s 33.73c/s 33.73C/s
62 0g 1:13:57:37 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.62p/s 33.62c/s 33.62C/s
63 0g 1:13:56:50 41.00% 2/3 (ETA: 2020-02-04 06:03) 0g/s 33.69p/s 33.69c/s 33.69C/s
64 0g 1:13:57:44 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.73p/s 33.73c/s 33.73C/s
65 0g 1:13:57:35 41.00% 2/3 (ETA: 2020-02-04 06:04) 0g/s 33.75p/s 33.75c/s 33.75C/s

2020-02-02 I removed the ssd drive from the master system to see if there it affects the problem of the master system "freezing".
If it does stop happening then I will try to power the drive from the main 5v supply (as opposed to the cubieboard on board power) and see if that solves the issue.

If the master system gets through 24-hours without freezing I will assume I had a power problem

Today I cleaned up the new wiring harness a bit so that is is better organized.