ASIC Bitmain Antminer S9K Manual de mantenimiento - Página 12

Navegue en línea o descargue pdf Manual de mantenimiento para Escritorio ASIC Bitmain Antminer S9K. ASIC Bitmain Antminer S9K 13 páginas.

ASIC Bitmain Antminer S9K Manual de mantenimiento
Common fault types of the S9K S9SE computing board:
1. Cooling fin falls, shifts and deforms
The cooling fin on the PCB board on the back of the computing board chip is not allowed to shift or collide before power on, especially the cooling fin with
different voltages. The contact of the cooling fins in different voltage domains means that there is a possibility of short circuit at different voltage points.
Moreover, determine that each of the cooling fin on the computing board has good heat conduction and is firmly fixed.
When replacing or re-installing the cooling fin, clean the residual adhesive on the cooling fin and chip and then coat again. The residual thermally conductive
adhesive can be cleaned with absolute alcohol.
2. Impedance imbalance in each voltage domain
When the impedance of some voltage domains deviates from the normal value, it indicates that there are open and short circuits in the abnormal voltage
domain. Generally the chip is the most likely to cause it. But there are three chips in each voltage domain, and often only one has problem when fault occurs.
The method of finding the problem chip can detect and compare the ground impedance of test points of each chip to find the abnormal point.
If there is a short circuit, first remove the cooling fin on the same voltage chip, and then observe whether the chip pin's tin is connected.
If a short-circuit point cannot be found on the appearance, search the short-circuit point according to the resistance method or current cut-off method.
3. Voltage imbalance in voltage domain
When the voltage in some voltage domains is too high or too low, there is usually an abnormal IO signal in the abnormal voltage domain or adjacent voltage
domain, which causes the next voltage domain to work abnormally and the voltage to be unbalanced. The abnormal point can be found by detecting the
signal and voltage of each test point, and some need to find the abnormal point by comparing the impedance of each test point.
Note that the CLK signal and the NRST signal are the two most likely to cause a voltage imbalance.
4. Lack of chips
The lack of chips means when the test box is being checked, not all of the 60 chips are detected, and often not all the chips are actually detected. The actually
lost (undetected) abnormal chips are not in the displayed position. At this time, it is necessary to accurately locate the abnormal chip through testing.
The locating method can use the RI cutoff method to find the location of the abnormal chip. That is, ground the RI signal of a chip, for example, after the RI
output of the 50th chip is grounded in the voltage domain, theoretically, if all the chips in the front are normal, the test box should display that 50 chips are
detected. If not all 50 chips are detected, it means that the abnormality is before the 50th chip; if 50 chips are detected, it means that the abnormal chip is
after the 50th chip. Use this dichotomy to find out where the abnormal chip is located.
5. Broken chain
A broken chain is similar to lack of chips, but in a broken chain, not all chips that cannot be found are abnormal, but all the chips after the abnormal chip are
invalid due to a certain chip abnormality. For example, a chip itself can work, but it will not forward other chip information; at this time, the entire signal
chain will come to an abrupt end, and lose a large part of it, which is called broken chain.
The broken chain port information can be displayed. For example, when the test box detects the chips, only 30 chips are detected. If the number of preset
chips is not detected in the test box, it will not run, so it will only display how many chips are detected, at this time, according to the displayed number "30",
the problem can be found by detecting the voltage and impedance of each test point before and after the 30th chip.
6. No running
No running means that the test box cannot detect the chip information of the computing board, but displays NO hash board; this phenomenon is the most
common and the fault range involved is also wide.
1) No running caused by abnormal voltage in a certain voltage domain; the problem can be found by measuring the voltage in each voltage domain.
2) A chip abnormality causes an abnormality that can be found by measuring each test point signal.
CLK signal: 0.9V; the signal is output from chip U1 chip to chip U60. In the current version, there are only two crystal oscillators, Y1 is transmitted from
the first chip to the 30th chip, and Y2 is transmitted from the 31st chip to the 60th chip, and the CLKO signal is abnormally searched according to the
direction of signal transmission.
CO signal: 1.8V; this signal is transmitted through chips U1, U2,,,,, U60, when a certain point in the binary method is abnormal, it can be detected
forward.
RI signal: 1.8V; this signal is returned from chips U60,,,,,, U2, U1, confirm the cause of the fault through the chip signal trend; when S9K S9SE
computing board does not run, the signal is the highest priority, first search for this signal.
BO signal: 0V, this signal can be lowered to high level when the chip detects that the RI return signal is normal, otherwise it is high level.
NRST signal: 1.8V; after the computing board is powered and the IO signal is inserted, the signal is transmitted from U1, U2,,,,, and U60 to the last chip.
3)LDO 0.8V, 1.8V abnormality maintenance
The normal value of the ground impedance of the LDO 0.8V IC output is 50-100 Ώ, and the normal impedance of the LDO 1.8V IC output is
0.9KΏ.
There are six LDO 1.8V single computing boards and twelve LDOs 0.8V (for example, the power supply of domain 1 U1-U10 is U61 LDO 1.8V ,
the power supply of U1-U5 is 0.8V U117, and the power supply of U6-U10 is 0.8V U79), Since the LDO is operated in series, the LDO ground
short-circuit can be repaired by using the two-fifth method. First, take the middle chip, remove them one by one, and find the problem chip to
replace it;
4)Single board Patter NG repair
Serial port print log (logo information), single-chip and whole-chip computing board none recovery rate needs to reach 98%, if noce response
rate is lower than 98%, report Patter NG; according to serial port print log, give priority to the replacement of the chip with the lowest single
chip noce recovery rate;
5) The whole machine J: 4 maintenance
1: J: 4 does not store the temperature sensing chip position, and needs to test with the test jig once, the temperature sending information is written
into the EEPROM chip IC through the single board test jig;
2: The single board jig configuration file is wrong (the chip of the computing board, the BIN level does not match the jig configuration file),
resulting in the whole machine reporting J: 4;
V. Fault Type
12
S9k S9SE Maintenance Guide