Monolith 2020 01

2020-01-16 I am going to disassemble the entire cluster and work on some intermittent power cables. While I am doing that I am going to attach heat sinks to the ARM A10 cpu (Allwinner) on each board. I purchased a large quantity of 19 mm x 19 mm x 5mm heat sinks and some thermal glue/paste to facilitate that upgrade. I installed (3) of the new heat sinks onto spare boards today, one of the heat sinks is oriented differently (the one with the console cable attached.) That was a mistake.

CubieBoard1-Heatsinks-01.jpg

While I am disassembling the entire cluster I will have to replace node59 which is running about 70c when idle.

In addition to all this work, I will also be re-implementing the power system, probably using a different power supply. I am currently using a power supply/charger but it doesn't seem to be producing enough current, the demand when cpu load is highest is only 20A but it seems to be falling behind.

Monolith-Power-01.jpg

As you can see the power system is a mess.

I located some Meanwell 30A 12V power supplies I am going to use. The only downside is charging the battery as the power supply doesn't produce an optimal voltage to charge the SLA battery. The charger needs to produce 13.2 and the new power supply will not produce more than 12.5. )

I am only concerned with using the battery to handle the initial load during power up and during power failures. This may work out fine, we'll see.

2020-01-18 I have begun disassembly of the Monolith. I removed the power supply/charger and the 35AH SLA battery then disconnected the external wiring. I removed Tier 1, Nodes 1-11. I disassembled the tier and attached the heat sinks to each computer. I will wait about 12-hours then reassemble the tier.

Rather than label each micro-sd card or label the computer I ejected all the micro-sd cards . Later I will load each into a card reader on a workstation, identifying them before I insert them where they belong. If I keep it to 11-micro-sd cards at a time it's not a huge problem.

Before I do that I will rework the wiring for Tier 1 as well. Each Tier power cable has (22) nodes attached so Tier 1 consists of Node 1-11 and Node 56-66. Node 59 is defective in that the CPU is running very hot (70c) at idle.

Later today I re-assembled Tier 1, Node 1-11 and re-installed into the Monolith. I have also just removed Tier 1, Node 56-66. One of the acrylic plates has cracked so I will try to fabricate a new one while installing the new heat sinks onto the cpu boards.

I can't seem to find any of my acrylic stock to fabricate replacements so I decided to order 12-pcs of cut-to-size 15cm x 12 cm x 2mm clear acrylic with shipping from Tap Plastics (tapplastics.com) , it was about $30 US. I will only replace the broken panels for now but eventually may replace all 12. I don't know how long it will take to get the new pieces as I live in an unusual location and sometimes it is difficult to get stuff.

I disassembled Tier 1 Node 56-66 and installed new heat sinks on each cpu board. I have saved each tier's micro sd cards into small plastic bags each containing 11 micro-sd cards. At worst case I will have to search through 11 each time which is better than having to sort through 66 of them.

The "master" node shown above is the only system to have a hard drive attached and a console cable attached it is pictured (above) sitting in the middle of all the other systems.

2020-01-20 I removed and disassembled Tier 2, Node 45-55 and then attached the new heat sinks. Later I will re-assemble the tier and set it aside. I have to work on Tier 2 power wiring harness as well since that set was having intermittent issues.

Later I did reassemble Tier 2, Nodes 45-55 and next I will remove the power wiring harness for Tier 2 and start to rework it, and checking all the soldered connections. Since I am waiting the acrylic pieces I ordered cut to size I will not be mounting them back into the Monolith frame until I receive them and have drilled the proper holes to mount them.

I reworked the Tier 2 power harness, I am not sure I resolved the intermittent issues as everything I checked seemed to be fine.

I also reworked the 12v and 5v power supply wiring, I added a 25A circuit breaker to the main 12v line that is fed from both the backup battery and the power supply, soldering everything including new terminating connectors.

2020-01-21 I connected all the power supply wiring and soldered up Tier 2 power harness and tested it with one tier of system boards and everything seems fine. I did not attach the backup battery yet, I am still working out how to do that in a way that allows me to use a heavier gauge connection (10 AWG) from the 12v side to the battery terminals. The small terminals on the 12v power supply won't really support a large connector, for me the real pain in the neck is the ground connection between the battery and the 12v power supply. I was surprised to see the chassis of the power supply is not electrically connected to ground which was depressing as I had hoped to attach battery ground there.

I removed and disassembled Tier 2 Nodes 12-22 and attached the new heat sinks to each system board. Then I reassembled Tier 2 Nodes 12-22 but am still waiting for the replacement acrylic pieces I ordered from Tap Plastics. I wanted to offer kudos to this company, they responded to my plea to use USPS and not FedEx or UPS as I have trouble getting deliveries from those entities. They took special attention to my order and manually shipped it to me using USPS and I am very grateful for their special care for my order.

2020-01-22 I removed and disassembled Tier 3, Nodes 22-33 and attached the new heat sinks. I also re-worked Tier 1 power harness.

I haven't mentioned this before but all of the threaded rods I used to assemble the tiers had one side where I cut the rod and the threads were damaged, every single rod. Last year I was fortunate to locate a miniature set of dies so I have been repairing the threads on every rod as I go along. I am using Dubro #4-40 (SAE) rods, so the die is also #4-40 in case you were wondering. You can see the set of taps and dies above.

Later I removed the final tiers, installed the remaining heat sinks and will work on the Tier 3 wiring harness tomorrow. As you can see the rack is now empty now and all tiers have been reassembled and am awaiting the custom cut acrylic pieces to arrive. I can't wait to get the entire cluster re-assembled and cranking back up to help me test network security, some docker work along with password cracking using John the Ripper.

2020-01-23 I reworked Tier 3 5v wiring harness and tested it. At this point I am simply going to have to wait for the acrylic pieces to arrive. I connected all the 5v wiring harness to all of the tiers and tested power. Everything is working, all systems are powering up.

I checked the power consumption per tier (22-nodes) and it's running around 4.5A each tier, or about 13.5A for all tiers on the 5-volt side. I checked total power consumption on the 12v side into the power supply and it's about 6A. None of the systems have a micro sd card installed nor are they connected to ethernet. By default each system boots into Android I assume the cpu are totally idle at this point.

The reason I am concerned about the current draw is that I am using a different power supply, 12V 30A and I am depending on it really being able to handle up to 30A as when the cluster cpu are 100% busy I was seeing at least 20-21A on the 12-volt side being used. I am now satisfied that the power-up surge is going to be fine using this new supply but won't know for sure if the new supply can handle 20-21 amps during cluster operations.

As I mentioned earlier I did not re-attach the 12-volt SLA battery yet. I am having an issue as to where to attach the battery wiring so that I can use 10-AWG cables.

2020-01-25 I had a very busy day working on the cluster. There is so much to cover for today with some interesting observations about how lame it is with the dynamic mac address of the Cubieboard and Ubuntu 18.04 Network Manager. Excruciating.

First because the mac address and the uiid changed on the network interface Network Manager won't initialize the interface so you have to mount the filesystem and comment out the uuid and mac-address values in the "Wired Connection 1" (use 1 for simplicity, nuke anything else…) which will allow the interface to initialize once.

Log in and edit /etc/NetworkManager/system-connections/Wired Connection 1 and update the uuid and mac-address for the new system board. i.e. nmcli c show and get the uuid from there and get the mac address from the output of ifconfig eth0 this is very straightforward to asccomplish but tedious.

This nonsense with Network Manager was an unintended consequence of rebuilding the tiers with cpu boards being assembled in no particular order. Curses to Network Manager. I never suspected this surprise was waiting for me and it added a lot of extra work for me. Very Lame!

Once you update "Wired Connection 1" with the current values for mac-address and uuid, reboot and ensure it comes up with the right static IP address (as configured for each system.)

This procedure would probably have to be performed if you were moving your system hard drive to another server. I think it's more appropriate for a server to manually hadcode everything and skip Network Manager altogether but you should probably use configuration management if dealing with more than a few hosts. I have to do this manually for 66-nodes. So far I've completed about 20 out of 66-nodes.

Last stages of rebuilding were the hardest for me, there was a lot of drilling and measuring and clamping and I did really well until the very last tier (Node 56-66) I broke the corner of one of the new acrylic plates I had just obtained. Then I realized the failure to order spare pieces of acrylic meant I had to make a new order to get a replacement (I glued it for now, but it looks ugly…)

I have a few images from today's work you can see how it was re-assembled in various stages.

I ordered replacement acrylic so that problem should be solved in a week or so, I have nodes 24-66 connected to the network and am slowly reconfiguring the network manager (read above.) Needless to say it is more than tedious but not an easy way to automate re-configuring the uuid and the mac-address since they changed.

I had to repair the AC power cord, the wires broke internally where the solder had hardened the wire braid. I ended up having to use crimp-on connectors and was not able to solder the power wiring due to that problem.

I completed assembly of the entire cluster but managed to crack an important acrylic plate and must now wait again for parts to arrive. I also finished reconfiguration of about 15-nodes before ending my work day.

2020-01-26 I completed reconfiguration of nodes 34-66 and ran stress tests on that part of the cluster. Everything works well. I am now reconfiguring nodes 1-33 which will take all day. Here are some pictures from the current state of the re-assembly.