RESOLVED Gliffy Online System Outage

**Please scroll down to the comments section for additional updates.**

3/21/16

Gliffy Online is currently experiencing a system failure. We are working as quickly as possible to restore access to our system. We will update this message when we have further information and also at the top of https://www.gliffy.com/. 

Thanks for your patience and we apologize for the inconvenience this causes. No data has been lost with this outage. 

We will update this post with additional ETA and information as it becomes available. 

3/21/16 8:30am PST

We are still working on the issue. Support unfortunately cannot access your diagrams and all updates and ETA information will be posted here. Thanks for your patience. 

3/21/16 11:00am PST

We have located the issue and are actively working to resolve it. We appreciate everyone's continued patience and we will post a confirmed ETA once we have one. 

3/21/16 1:00pm PST

We have further pinpointed the issue and are close to a resolution. We will be able to give an ETA hopefully in the next update. 

3/21/16 4:00pm PST

Hi everyone...we're truly sorry for the inconvenience this is causing and we have all hands on decks trying to correct the problem. Rest assured, this is our top priority and we are doing everything possible to get everyone access to their diagrams. We are hoping to have this resolved in the next several hours. We appreciate everyone's continued support and patience you've expressed. 

If you would like automatic notifications when this is updated, click the "subscribe" link in the upper right corner of this article (must be logged into Zendesk). Thanks again. 

Subscribe_circle.png

Was this article helpful?
10 out of 10 found this helpful
Have more questions? Submit a request

18 Comments

  • 0
    Avatar
    Katy Kelly

    3/21/16 5:30pm PST

    We hope to have this resolved tonight. We cannot provide an exact ETA at this time unfortunately as there are several variables regarding this. Keep checking back for updates. Thanks again for your continued patience. 

  • 0
    Avatar
    Katy Kelly

    3/21/16 8:00pm PST

    We discovered an issue in one of our backup systems last Thursday night (03/17). Maintenance was scheduled to resolve the issue over the weekend. On working to resolve the issue, an administrator accidentally deleted the production database.

    The good news is that we have copies of our database that are replicated daily, up until the exact point of time when the database was deleted. We are working hard to retrieve all of your data.

    We have been in the process of restoring our database. Due to its sheer size, a restore can take days to complete. While that is running in the background, we’ve been attempting different tactics in parallel to restore the database and your data quicker. If one of the current attempts is successful, we can be online as early as tomorrow morning, pacific standard time (PST). However, we will not know for certain until it has been completed.

    We feel like we have failed you, our customers, and you expected better from us. After the restoration is completed, we will be taking a hard look at our processes and procedures to understand how and why this happened in the first place, and if there are other issues to be resolved as well. It is important to us that we meet the needs of our customers and ensure there isn’t a recurrence of this issue or others that could hinder your productivity.

    Please stay tuned for further updates.

  • 0
    Avatar
    Chris K

    3/22/16 3:19am PST

    We have 3 parallel and redundant restore processes running to increase our chances of successfully bringing up the system today. We are investigating starting a 4th restore process in the event the first 3 fail for some reason.  We believe one of the 3  restore processes will be complete in the next several hours. Once one of the restore processes is complete, there will be additional work that our engineering team must do to ensure data integrity. 

    I'm not able to provide an ETA for complete system restoration since we wont know if one of the restore processes has been successful until it's successful. I can say that it's unlikely that we'll be up and running in less than 3 hours from now.

     Chris Kohlhardt, CEO

     

  • 0
    Avatar
    Chris K

    3/22/16 5:32am PST

    The three restore processes vary in completeness between 70%-80%.  Again, once one of the restore processes is complete, there will be additional work that our engineering team must do to ensure data integrity and get the system running. 

  • 0
    Avatar
    Chris K

    3/22/16 7:24am PST

    Unfortunately one of the restore processes failed because it used significantly more disk space than we anticipated. The other restore processes have been configured with more disk space to reduce the chance of this problem happening again. 

    We have added another restore process and we again have 3 restore processes running in parallel.

    We estimate that one of the restore processes will complete in 12 hours and additional work will be required to fully bring that one online if it succeeds.

    A second restore process could take as long as 24-36 hours to succeed. 

    A third restore process was just started and we do not have an estimate for completion at this time.

    We are actively looking into other options that will bring the system online more quickly. 

    It goes without saying that we are very sorry for the impact this has had on our customers. We will report more as we have more to share. 

    Chris Kohlhardt, CEO

  • 0
    Avatar
    Eric Chiang

    3/22/16 12:00pm PST

    Our restore processes are running smoothly.

    I wanted to give everyone a little more information about the 3 restore processes that Chris has discussed.

    The first restore process is running significantly faster than the others. The reason this is the case is because we have provisioned much more powerful hardware to accelerate the restore. However, this hardware exists outside of our production facility. Rest assured that we have maintained the same amount of strict security that we have in place for our production environment. We anticipate this to complete in ~8 hours time. However, we will need to prepare this restore and move it into our production facility, which I estimate to be ~4 hours.

    The second restore process is running directly in our production facility. It is anticipated to complete in over 24 hours time. This restore is outside of our timeframe to deliver our systems back to availability for our customers, but we are having it continue to run as a backup plan.

    Our third restore process is a backup to the first restore process and will only be needed if the first one fails for whatever reason.

    In summary, we’re hoping to have our systems back up and running in the early hours of tomorrow morning.

    Our engineers are working around the clock along with our hosting provider to get our systems up and running as soon as possible. We apologize for the inconvenience and appreciate your patience thru this process.

    Thanks,

    Eric Chiang

    Head of Engineering

  • 0
    Avatar
    Eric Chiang

    3/22/16 3:30pm PST

    Our restore progress continues to be on track and we're feeling confident that we'll be able to meet our previously proposed timeframes. 

    We're currently working with our hosting provider to provision more space to accommodate our restored database. This will most likely be the long pole in our process but we're hoping they will come thru for us before the end of the day today. 

    Thanks,

    Eric Chiang

    Head of Engineering

  • 0
    Avatar
    Eric Chiang

    3/22/16 5:00pm PST

    Some good news to share... our hosting provider was able to provision the storage we needed in the timeframe we requested. Our recovery processes are also running smoothly.

    Things continue to be on track!

    Stay tuned.

    Thanks,

    Eric Chiang

    Head of Engineering

  • 0
    Avatar
    Eric Chiang

    3/22/16 9:00pm PST

    Our recovery process has completed and we're beginning to send over our restored database to our production facility.  This process should take 4+ hours to complete. 

    We'll post another update when this has finished.

    Thanks,

    Eric Chiang

    Head of Engineering

  • 0
    Avatar
    Eric Chiang

    3/23/16 12:30am PST

    The data transfer is taking longer than expected. At this rate we're expecting it to complete closer to 11am.

  • 0
    Avatar
    Eric Chiang

    3/23/16 8:30am PST

    The data transfer is going at the previously estimated rate. An update will be provided when that completes.

     

  • 0
    Avatar
    Eric Chiang

    3/23/16 11:00am PST

    We've gotten our database backup restore from Saturday night into our production environment. We are now attempting to restore the remaining data between Saturday night to Sunday night from binary logs from the local system.  This will recover all of your data up to the point of our outage.

    We'll ensure that we have our replication process up again before starting up our application and restoring access to your diagrams.

  • 0
    Avatar
    Eric Chiang

    3/23/16 2:15pm PST

    We're finishing up testing of our Saturday-Sunday data restore. We'll be applying it to our production data very soon.  We'll try to provide updates in the next hour.

  • 0
    Avatar
    Eric Chiang

    3/23/16 3:30pm PST

    We're crossing our t's and dotting our i's. We're prepping our secondary database servers to setup replication.  Another update soon.

     

  • 0
    Avatar
    Eric Chiang

    3/23/16 5:30pm PST

    Our secondary database servers have been seeded with data. We're working on final configurations on the databases now.

  • 0
    Avatar
    Eric Chiang

    3/23/16 8:00pm PST

    All databases are now catching up on Sunday's data. Once that happens, we'll be starting up our application servers.

  • 0
    Avatar
    Eric Chiang

    3/23/16 9:00pm PST

    We are doing an internal smoke test with our application servers active.  We'll be providing access soon. As long as no major issues surface.

  • 0
    Avatar
    Eric Chiang

    3/23/16 9:30pm PST

    It's been a long few days. There's been great words of support from our fantastic users. We're happy to report that Gliffy is back up and running!  

    All data has been restored. We've made a full recovery from the very point of our initial disaster.

    If you have any trouble or find any issues please contact support@gliffy.com and we'll be ready to assist.

Article is closed for comments.