top of page

11/4/2025 - Critical Server Outage - TrueNAS Boot Failure - Service Restored - Temporary Solution Implemented - DT <6hrs

I'm currently dealing with a critical server failure that resulted in complete loss of service. The TrueNAS SCALE server experienced an unexpected shutdown, and upon restart, the system was unable to boot due to boot pool corruption. All media streaming and automation services were unavailable while emergency recovery procedures were underway.

Issue Summary

  • Date: November 4, 2025

  • Duration: 9:46am EST - 3:45pm EST (~6 hours)

  • Affected Services: All services (Plex, media automation, file sharing)

  • Root Cause: TrueNAS boot pool failure following unexpected system shutdown compounded by storage pool at 100% capacity

  • Status: Services Restored - Temporary solution in place

Reported outage at: 9:46am ESTServices restored at: 3:45pm ESTCurrent status: Online - All services operational

Issue identified: TrueNAS SCALE system powered off unexpectedly after approximately 2 months of stable operation. Upon restart, the system reported "Cannot import 'boot-pool': no such pool available" preventing the server from booting. Investigation revealed that the primary data pool (218TB enterprise media server) had reached 100% capacity, causing cascading failures including boot pool corruption and VM suspension. The full storage pool prevented critical system operations and caused I/O errors that prevented normal recovery procedures.

Resolution Implemented:

Emergency storage expansion - Added 60TB of additional storage (3 x 20TB drives in RAIDZ1 configuration) providing 40TB usable capacity, bringing pool utilization down to 85.7%. This provided sufficient headroom for system operations to resume.

Boot pool recovery - Successfully imported boot pool and restored system boot capability after storage pressure was relieved.

VM restoration - Restarted VM and expanded virtual disk from 210TB to 230TB. Extended filesystem partitions to utilize new capacity, bringing media storage from 96% to more sustainable levels.

System verification - Confirmed all services online: Plex media server, automation services (*arr stack), and file sharing restored to full operation.

Temporary Solution Notice: The current fix provides immediate service restoration but is a temporary measure. A permanent solution is scheduled for implementation within the next few days.

Permanent Solution (Scheduled):

  • Additional 60TB of storage arriving in 2-3 days

  • Will provide long-term capacity headroom

  • Implement monitoring alerts to prevent future capacity issues

  • Review and optimize storage allocation strategies

ETA for Permanent Fix: 2-3 days pending hardware delivery

All services are now fully operational. Thank you for your patience during this critical maintenance window.

 
 
 

Recent Posts

See All

Comments


bottom of page