@namnnumbr

namnnumbr · 1 year ago

Look into beautiful soup (bs4) for parsing html web pages. Once you have that cleaned, you should be able to do any kind of NLP modeling over it.

Both Huggingface and SpaCy have pretty good tutorials/walkthroughs for tokenizing your data, doing entity extraction, etc.

That said, it would be helpful for you to figure out what you want to do with your project — for example, you could try to identify keywords relevant to job tiles, you could try to use similarity metrics to recommend new roles based on the current one, etc.

namnnumbr · 1 year ago

https://yetch.store/products/every-day-goal-calendar For a physical/digital device. … way more expensive than I remember it being (especially given it’s single-user), but it’s a pretty good incentive to keep up with less-fun habits

namnnumbr · 1 year ago

Authy is lovely in that it just works, but it is hellacious to migrate off of if you change your mind.

I also don’t love that Authy is owned by Twilio, a communications/marketing service company.

namnnumbr · 1 year ago

A password manager can be considered critical infrastructure; beyond privacy and uptime/access considerations, you should also consider what happens if you lose all of your data - Do you have backups? Are the backups 3-2-1 redundant? Do you have a ready-to-go docker compose to get yourself up and running locally in a pinch?

I self-hosted bitwarden (vaultwarden) for several years and it became evident to me that it was important enough to use the hosted service - especially as I was already paying Bitwarden to support their open source business.

namnnumbr · edit-2 1 year ago

https://www.zimaboard.com/

recent blog from hacker news

I can’t personally attest to the “easy to use self hosting OS” since I immediately installed Ubuntu (soon to be Debian) but the hardware is good and the preinstalled OS should let you get a feel for things.

namnnumbr · 1 year ago

I think I get what you’re after now. I’ll have to think on this further - interesting problem!

namnnumbr · 1 year ago

IMO there is a difference between adding “knowledge” and adding “facts”. You can fine tune in domain knowledge but it will be prone to hallucination. To ground the instructions, you’d need to introduce RAG for fact lookup; possibly with a summarization step if you want to bring in large bodies of facts.

namnnumbr · 1 year ago

I don’t think fine tuning works the way you think it does; one does not generally fine tune to “add facts”. This might be useful: https://nextword.substack.com/p/rag-vs-finetuning-llms-what-to-use

I’d advocate for using the RAG pattern to do the lookups for the new facts. If needed, you can fine tune the model on top to output for your specific domain or format.

namnnumbr · 1 year ago

That’s fair; I guess it depends on what your threat model is — kind of like how using a vpn can just expose you to your vpn service while ostensibly protecting you from your service provider.

To me, the improved search results from kagi and the disconnect between search and ad-and-tracking companies are worth it. But that may not be a fit for anyone else.

namnnumbr · 1 year ago

I strongly advocate for Kagi. Yes, it’s paid search, but it means that there is no tracking or ad revenue concerns obfuscating the search results.

namnnumbr · 1 year ago

IIRC, the biggest issue with TrueNAS SCALE + Docker is that they really run the containers on a ‘hidden’ kubernetes cluster and obfuscate the standard docker and docker-compose way of doing things behind a gui with limited customization and poor field descriptions.
I found it much easier to spin up a VM on SCALE and run docker through that, although then you have to deal with multilayer networking.

… To be fair, this was when SCALE was still in beta, so it has possibly improved since then.

namnnumbr · 1 year ago

It’s not just every tech company, it’s every company. And it’s terrifying - it’s like giving people who don’t know how to ride a bike a 1000hp motorcycle! The industry does not have guardrails in place and the public consciousness “chatGPT can do it” without any thought to checking the output is horrifying.

namnnumbr · edit-2 1 year ago

python packaging authority and pytest have pretty good resources on standard repo structure; poetry is a new-kid-on-the-block tool to get started developing packages quickly (i.e., standard repo config, handles dependency environments / works as build tool)

re: formatting & style – others have mentioned black; I also recommend ruff to lint/standardize your code to many accepted best practices.

import is kind of a clusterf, because you can have absolute import packagename and relative from . import x.y.z imports, and importing a project-in-development can depend on the IDE you use (i.e., vscode and pycharm are generally smart enough to figure out that a src/ dir in the workspace should be importable, but not always). Using pip install -e can install your project-in-dev in “editable” mode and make it available for import. The modules docs may help here.

Package management/locking is a a (relatively) rapidly evolving part of the python ecosystem. Because Python can be so dependent on the packages installed in the environment, simply managing the python version (like you would with pyenv) is insufficient, and it is recommended to create pseudo-hermetic virtual environments per project (venv, virtualenv, poetry, or conda help with this). I can’t help with pyenv (I use conda); this might be helpful. I think you would use pyenv to manage the python version and then venv or virtualenv to manage the installed packages. Personally, I would first get used to managing virtual environments with venv and then deal with pyenv later if you decide you need multiple python versions

namnnumbr · 1 year ago

I’d recommend OPNsense over PFsense due to multiple shady moves by netgate (the parent company of pfsense), including moving to closed-source:

pfsense is falsely open-source: https://news.ycombinator.com/item?id=26476030
pfsense botched/rushed their wireguard implementation: https://forum.endeavouros.com/t/migration-from-pfsense-to-opnsense-drama-about-wireguard/12798
pfsense squatted on competitor domain and used underhanded/defamatory practices: https://opnsense.org/opnsense-com/

If you don’t mind the drama, both PFsense and OPNsense are perfectly competent router OSes.

Regarding hardware:

OPNsense also sells rack-mountable server hosts.
OP may not actually need a rack-mounted server – I have several machines just sitting on a 2u rack-mounted shelf. My opnsense install runs on a cheap protectli box, and there’s enough room for a handful of raspberry pis and their power bricks on the shelf next to it.