HP SSDs fail after 32768 hours of operation due to critical bug

Discussion in 'Technical' started by sec_monkey, Nov 28, 2019.


  1. sec_monkey

    sec_monkey SM Security Administrator

    HP SSDs fail after 32,768 hours of operation due to critical bug

    HPE Support document - HPE Support Center

    Hardware Platforms Affected: HPE Synergy 480 Gen9 Compute Module, HPE Synergy 660 Gen9 Compute Module, HPE 400GB 12G SAS Mixed Use-3 SFF 2.5-in SC 3yr Wty MO0400JFFCF Solid State Drive, HPE 800GB 12G SAS Mixed Use-3 SFF 2.5-in SC 3yr Wty MO0800JFFCH Solid State Drive, HPE 1.6TB 12G SAS Mixed Use-3 SFF 2.5-in SC 3yr Wty MO1600JFFCK Solid State Drive, HPE 3.2TB 12G SAS Mixed Use-3 SFF 2.5-in SC 3yr Wty MO3200JFFCL Solid State Drive, HPE 480GB 12G SAS Read Intensive-3 SFF 2.5-in SC 3yr Wty VO0480JFDGT Solid State Drive, HPE 960GB 12G SAS Read Intensive-3 SFF 2.5-in SC 3yr Wty VO0960JFDGU Solid State Drive, HPE 3.84TB 12G SAS Read Intensive-3 SFF 2.5-in SC 3yr Wty VO3840JFDHA Solid State Drive, HPE Synergy 620 Gen9 Compute Module, HPE Synergy 680 Gen9 Compute Module, HPE ProLiant XL270d Gen9 Server, HPE D6020 Disk Enclosure, HPE StoreVirtual 3000 Storage, HPE D8000 Disk Enclosures, HPE ProLiant SL230s Gen8 Server, HPE ProLiant BL460c Gen8 Server Blade, HPE ProLiant BL465c Gen8 Server Blade, HPE ProLiant DL160 Gen8 Server, HPE ProLiant BL420c Gen8 Server Blade, HPE ProLiant DL320e Gen8 Server, HPE ProLiant WS460c Gen8 Graphics Server Blade, HPE ProLiant BL660c Gen8 Server Blade, HPE ProLiant DL560 Gen8 Server, HPE D6000 Disk Enclosure, HPE StoreEasy 1000 Storage, HPE D2220sb Storage Blade, HPE ProLiant SL210t Gen8 Server, HPE StoreVirtual 4335 Hybrid Storage, HPE ProLiant DL580 Gen8 Server, HPE D3000 Disk Enclosures, HPE ProLiant DL160 Gen9 Server, HPE ProLiant DL180 Gen9 Server, HPE ProLiant DL360 Gen9 Server, HPE ProLiant BL460c Gen9 Server Blade, HPE ProLiant DL380 Gen9 Server, HPE ProLiant ML350 Gen9 Server, HPE ProLiant XL230a Gen9 Server, HPE ProLiant DL388 Gen9 Server, HPE ProLiant DL120 Gen9 Server, HPE ProLiant WS460c Gen9 Graphics Server Blade, HPE ProLiant DL580 Gen9 Server, HPE ProLiant BL660c Gen9 Server Blade, HPE ProLiant DL560 Gen9 Server, HPE Apollo 4200 Gen9 Server, HPE Apollo 4500 System, HPE ProLiant XL450 Gen9 Server


    @melbo

    @Lancer

    @VisuTrac
     
    techsar likes this.
  2. VisuTrac

    VisuTrac Ваша мать носит военные ботинки Site Supporter+++

    2^15?? hmm, sounds like they were keeping one of the powers in reserve. Not a bug but rather a feature ;)
     
  3. sec_monkey

    sec_monkey SM Security Administrator

    It is actually 2^16 .. .. a signed 16bit integer .. ..

    -32768 to 32767
     
    VisuTrac likes this.
  4. oil pan 4

    oil pan 4 Monkey+++

    Does turning it off and back on at around 30,000 hours fix it?
     
    sec_monkey likes this.
  5. BTPost

    BTPost Stumpy Old Fart,Deadman Walking, Snow Monkey Moderator

    not likely... It is a firmware issue with the drive hardware...
     
    sec_monkey likes this.
  6. sec_monkey

    sec_monkey SM Security Administrator

    negative. the only known fixes are to upgrade the firmware and pray it does not break things or replace the affected drives with new drives from another vendor. in the meantime the only workaround is to ensure there are frequent backups of the data on the affected drives and the backups have to be stored on reliable drives that are not affected by this firmware bug. and better yet external drives from other vendors.

    also RAID in this case will not help because several users have reported that all of the SSD drives in their RAID arrays failed within 15 minutes of each other.
     
    3M-TA3 likes this.
  7. Lancer

    Lancer TANSTAFL! Site Supporter+++

    Almost amusing, almost. We don't use HP drives anywhere except internal servers. We Did have a support bulletin go out when HP released this feature. We've seen a few customers lose every single drive in a raid set though. In these events they were all the original drives, and started up simultaneously when the boxes were installed. Mostly just took out the OS, with the data residing elsewhere on FC or NFS connected arrays.
     
    sec_monkey likes this.
  8. ghrit

    ghrit Bad company Administrator Founding Member

    So the failure, by design or accident falls around 4 years of continuous operation, emphasis on continuous. Other than computer use for something like servers or process control, it would seem that intermittent use would extend service life by the off time added to the on time.
     
  9. 3M-TA3

    3M-TA3 Cold Wet Monkey

    Worked in the manufacturing side of the industry for a while... I only buy Samsung and Micron/Crucial SSD's.

    Don't buy anything made by Hynix ever. If you are lucky the chips started out life as Micron or Samsung, then were re-cased and branded Hynix (formerly Hyundai) . Yes, they did that frequently because their QC sucks and it wound up being cheaper at times to do this than make their own with insane failure rates. This crap shack is the second largest RAM manufacturer and the third largest semi conductor manufacturer. Many of the cut rate RAM that you see are in reality Hynix.

    Only buy SSD and RAM from brands that make their own chips.
     
    sec_monkey likes this.
  10. DarkLight

    DarkLight Live Long and Prosper - On Hiatus

    In the case of SSD, if they have power to them, the clock is ticking regardless of use load. Also, from the list in the original post, these are server only SAS drives. And while you can clearly use server drives in a PC, it would require a SAS controller which is expensive overkill for a laptop (if even available) or desktop PC.

    Most non enterprise external enclosures don’t support SAS and I haven’t ever found a USB-to-SAS adapter, although I haven’t looked in a couple of years.

    The thing to remember about SAS vs SATA is that you can plug SATA drives into a SAS controller and it will likely work (depending on the controller, driver and OS) but you CANNOT physically plug a SAS drive into a SATA controller much less make it work.


    An additional concern is if you are a 100% HPE shop in the data center. Many HPE arrays have SAS SSDs for both cache and primary storage, so you could potentially lose all the online/near-line data from both server and SAN/NAS/NFS if that is backed by those same SSDs. Worse, if you use a VTL (virtual tape library) for backup that uses those drives and everything was installed at roughly the same time. You could literally lose the server, SAN and all of your backups within hours or even minutes of each other with zero ability to recover.

    Our monkey down event was rough and a wake up call but if you hit that trifecta then you cut everyone their last paycheck and start selling hardware as you are out of business.
     
    Last edited: Nov 29, 2019
    sec_monkey and 3M-TA3 like this.
  11. sec_monkey

    sec_monkey SM Security Administrator

    @ghrit they aint gonna make it to 4 years. they fail at about 3 years and 9 months.

    @Lancer @3M-TA3 @DarkLight yep yep. some shops and even govt. entities are or were entirely HP or HPE so it is quite possible that nearly all hardware could be HP or HPE and that it could fail at the same time or nearly the same time taking everything out.

    this is a yuuuuuuuuge deal if yer affected.

    also another batch of firmware is gonna be released in December. and the affected list might be longer than initially thought, we will find out eventually.
     
  12. VisuTrac

    VisuTrac Ваша мать носит военные ботинки Site Supporter+++

    Planned obsolescence of both the hardware and the data. Wonder how many of these drives are being used in the cloud. Should be a good test of disaster recovery plans.
     
    sec_monkey likes this.
  13. BTPost

    BTPost Stumpy Old Fart,Deadman Walking, Snow Monkey Moderator

    Hey SEC... We aren't using HP Drives in the Monkey Server are we????
     
    sec_monkey likes this.
  14. sec_monkey

    sec_monkey SM Security Administrator

    @BTPost

    nope.

    not HP or HPE.

    different brand. the current drives require special handling.

    [dunno] [dunno] wut the new server has or will have

    methinks the HPE OEMs have not been officially revealed in the announcement

    @3M-TA3 said it was probably Hynix/Hyundai? Is that right 3M?
     
  15. 3M-TA3

    3M-TA3 Cold Wet Monkey

    No, the story reminded me of how Hynix is garbage and they have a great deal of the market share of semiconductors including SSD's and RAM. I thought it was worth mentioning even though it was off topic in the thread. I would , however, be completely unshocked to find out that they made the SSD's.

    That market share, BTW, is due to our tax dollars. Politics requires that S. Korea gets a TON of US aid to keep them as an ally, so we make sure the Korean banking system is solvent. The banking system is controlled by the S Korean Government who wants to ensure a strong economy by preventing big Korean corporations from failing. Hynix is so big the government won't let them fail, so there is no real need to them to be profitable.

    In short, when Hynix needs money, the Korean controlled banks provide it without question knowing that Uncle Sam will just write them a bigger check. Who needs QC when money is infinite and essentially provided by the US taxpayer? In fact, when I was in the biz they were selling far cheaper than they could manufacture just to get and maintain market share.
     
    sec_monkey likes this.
  16. sec_monkey

    sec_monkey SM Security Administrator

    Hynix is now part of the SK group and is now known as SK Hynix.

    Their chips can be found in Apple products, HP consumer products, HPE products, IBM, ASUS, Dell systems and many other devices including versions of the Pi.

    They definitely do biz with HP and HPE but I cannot say who the OEMs of the SSDs with the bad firmware were.

    That is disturbing, however, I will say better them than china.
     
    Last edited: Nov 29, 2019
survivalmonkey SSL seal        survivalmonkey.com warrant canary
17282WuJHksJ9798f34razfKbPATqTq9E7