11/4/2025 - Critical Server Outage - TrueNAS Boot Failure - Service Restored - Temporary Solution Implemented - DT <6hrs
- Atom 5ive
- Nov 4, 2025
- 2 min read
I'm currently dealing with a critical server failure that resulted in complete loss of service. The TrueNAS SCALE server experienced an unexpected shutdown, and upon restart, the system was unable to boot due to boot pool corruption. All media streaming and automation services were unavailable while emergency recovery procedures were underway.
Issue Summary
Date: November 4, 2025
Duration: 9:46am EST - 3:45pm EST (~6 hours)
Affected Services: All services (Plex, media automation, file sharing)
Root Cause: TrueNAS boot pool failure following unexpected system shutdown compounded by storage pool at 100% capacity
Status: Services Restored - Temporary solution in place
Reported outage at: 9:46am ESTServices restored at: 3:45pm ESTCurrent status: Online - All services operational
Issue identified: TrueNAS SCALE system powered off unexpectedly after approximately 2 months of stable operation. Upon restart, the system reported "Cannot import 'boot-pool': no such pool available" preventing the server from booting. Investigation revealed that the primary data pool (218TB enterprise media server) had reached 100% capacity, causing cascading failures including boot pool corruption and VM suspension. The full storage pool prevented critical system operations and caused I/O errors that prevented normal recovery procedures.
Resolution Implemented:
Emergency storage expansion - Added 60TB of additional storage (3 x 20TB drives in RAIDZ1 configuration) providing 40TB usable capacity, bringing pool utilization down to 85.7%. This provided sufficient headroom for system operations to resume.
Boot pool recovery - Successfully imported boot pool and restored system boot capability after storage pressure was relieved.
VM restoration - Restarted VM and expanded virtual disk from 210TB to 230TB. Extended filesystem partitions to utilize new capacity, bringing media storage from 96% to more sustainable levels.
System verification - Confirmed all services online: Plex media server, automation services (*arr stack), and file sharing restored to full operation.
Temporary Solution Notice: The current fix provides immediate service restoration but is a temporary measure. A permanent solution is scheduled for implementation within the next few days.
Permanent Solution (Scheduled):
Additional 60TB of storage arriving in 2-3 days
Will provide long-term capacity headroom
Implement monitoring alerts to prevent future capacity issues
Review and optimize storage allocation strategies
ETA for Permanent Fix: 2-3 days pending hardware delivery
All services are now fully operational. Thank you for your patience during this critical maintenance window.


Comments