Offgrid internet-in-a-box project - Part four
This is part of a series I am writing about a fictitious scenario where I am moving offgrid for an entire year and want to bring some tech with me. Part one is the rules of the game and the hardware I chose. Part two is which distro. Part three is prepping software for offline installs.
In the previous posts in this series I covered a broad range of topics in each. This post is specifically about saving webpages and entire websites and then accessing those sites completely offline. The tools I’m using to do this is zimit, SingleFile, and Kiwix.
Other uses
I have been writing these posts about a made-up scenario where I’m moving to a cabin that has limited power and no internet so I can build something fun.
However, given the current political climate in the US (where I am based) and other countries, this information can be used to prepare for potential information access disruption. I am not addressing that in these posts, but I trust you know how to use your imagination to extrapolate this guide into other uses.
Backing up single webpages
For years I’ve been testing various bookmarking systems that save an offline version of a webpage to add to a much smaller, personal version of Archive.org. As a digital hoarder archivist, I take it personally that it is my responsibility to save the things I enjoy on the internet. If you don’t save you, you’ll lose it.
Using Archivebox (good)
Currently, I run a private instance of Archivebox and I think it is an important project. Archivebox allows you to save complete webpages in various formats and the makes it searchable via a web dashboard.
Nothing against the Archivebox project. I love it and use it in my homelab. But it is complicated to setup and use. It is also heavy to run. Last, I don’t like how it organizes data. For example, I setup a local instance and tested it on my laptop. The page it saved was located in:
/mnt/storage/archivebox/archive/1738624915.169958/github.com/ArchiveBox/ArchiveBox/wiki
The biggest pain point is the ID number directory. It is not a way I want to find the pages I saved and doesn’t make it easy for managing the files.
For this project and for my daily use, Archivebox is not the right tool. I want something that is just a click away, saves an entire webpage, and just drops a file into a folder I can click to open.
SingleFile (best)
SingleFile is a browser extension that saves a webpage in a singe HTML file, while also having many powerful tools built-in. Not only can you save a complete webage, but you can annotate it, upload to cloud storage, and automate snapshots. No other tools are needed other than the extension.
I recommend using SingleFile with Firefox or one of the Firefox derivatives, rather than a Chromium-based browser. Since Chrome introduced Manifest V3, there are limitations for SingleFile you may encounter. The project has been working diligently to make saving pages on Chrome better, but in my opinion, just use Firefox.
There are many ways to use SingleFile. Here is how I use it:
- Set file format to “self-extracting ZIP (universal)” to keep file sizes smaller.
- Open the “save as” dialog to confirm file name.
- Remove date and time from file name.
- Set the destination of the file to my WebDAV server, that is then synced to my computers and phone via Syncthing.
- Uncheck scripts, videos, and audios, from the blocked resources. I want the full page and all of its contents. This can make the page storage size large, but I want all the info. For context, I saved this XDA post for setting up Debian on Android in Termux, which includes many pictures and a video. Without scripts and videos, the page size is ~5MB. With everything, the page is ~84MB. You choose what is best for you.
Now when a page is saved, it is synced to all my computers and can be opened with any web browser. It is also simple to find as every page is in a single directory. I can then use the OS search tool or fzf to find any of my archives.
For the offgrid and offline project, all I need to do is transfer this folder over to the storage drive I’m taking with me and then I have access to every page I ever saved. If you are starting from scratch, you can batch save webpages.
Saving whole sites
Since this scenario has me offline for an entire year, there are some sites I want a complete copy of so I can reference it constantly. Examples are sites like cooking/recipe blogs or fanfiction sites. This is where the combination of zimit and Kiwix are perfect for this build.
zimit
Using zimit you can crawl and scrape an entire site and then access it offline with a zim reader, which in this case is Kiwix. There are many options with zimit and I encourage you to peek at their Github to see what you can do with it.
Here is an example using Docker to run zimit on my site:
docker run -v /path/to/storage/zimit:/output --shm-size=2g ghcr.io/openzim/zimit zimit --url https://blog.ctms.me --name dom_blog --workers 2 --waitUntil domcontentloaded
In this command I’m telling it where to save the files, the shm-size for the browser used in Docker, the URL to save, how many workers, and to wait until the content is loaded.
I love this tool and use it occasionally to save an entire website into my archive. A couple things to remember:
- It is a snapshot in time. There is no way to crawl a site an only update changes. You’ll need to crawl it again.
- Please be mindful of the site you are scraping. Don’t send dozens of workers and DDOS the site. Please don’t scrape it continuously, driving up their hosting fees. Use sparingly and respectfully.
Accessing the saved sites (and more)
Kiwix is an awesome tool. Kiwix not only allows you to view your custom zim sites, but you can download sites from their own library, including an entire copy of Wikipedia. Kiwix then has servers, desktop and mobile apps, and a progressive web app for viewing these files.
In my offline build, I wanted 4-5 custom websites (not listed here), plus several from the Kiwix library. I need to have information, such as Wikipedia, to browse and learn about while offline. I also want things like a medical dictionary, gardening info, and survivalist data in case I need it. Last, I want info on the various computing devices I have around in this cabin, such as StackExchange, AskUbuntu, and even iFixIt.
Here are the Kiwix library zim files I grabbed:
- Wikipedia (with images)
- Android StackExchange
- Arch Linux Wiki
- AskUbuntu
- Gardening StackExchange
- iFixIt
- MDWiki
- ServerFault
- StackOverflow
- SuperUser
- Termux Wiki
- Unix StackExchange
- Wiktionary
- Mankier
- Ready.gov
- Canadian prepper
- Urban prepper
- Military medicine
- zimgit: Food preparation, knots, post disaster, and water
IMPORTANT: What version of Kiwix you decide to use is critical to viewing archives. For any site you save with zimit, you need to use a version of Kiwix that supports service workers so it can display https sites. Pages that come from an https endpoint need to be served over https.
This means if you are running the Kiwix server via Docker, you need to use a reverse proxy with a SSL cert. To get around this, I recommend using the official Kiwix PWA, which will display any zim file on the PC and use the official Kiwix app on Android. Using Chrome on the desktop, you can create a standalone desktop file from the PWA for a “near native” experience. Just remember to not clear your cache when offline as you’ll lose the PWA.
Conclusion
This guide was for downloading and archiving web pages and sites to view them offline in a situation where I am “planning” to be offline for an entire year.
However, as I mentioned above, we can all see how access to information may become more difficult over the next four years and beyond. Using tools like SingleFile, zimit, and Kiwix are important for all of us to create our own archives of the the parts of the web we enjoy or information that needs to be kept and shared.
I want more people to become archivists. I wish more people would create their own little archives and keep them saved in the homes. It is on us to save what matters. Anyone who has spent time online knows the web isn’t forever.
- - - - -
Did you like this post? Give it an upvote by clicking on the arrows below! Sending me an upvote is like you and I giving each other a high five.
🙏 😎
Thank you for reading! If you would like to comment on this post you can start a conversation on the Fediverse. Message me on Mastodon at @cinimodev@masto.ctms.me. Or, you may email me at blog.discourse904@8alias.com. This is an intentionally masked email address that will be forwarded to the correct inbox.If you enjoy the random stuff I write here, post to Mastodon, or watch on YouTube, and are feeling generous, I am open to tips of Ko-fi.