Although spamming isn’t costing the resource for now (is it? I don’t know?), it could ruin the experience for legit users. These spam accounts are filled with ads for call girls, shops selling services (cake, laundry, thrift-store, clinics, duplicated users, you get it), betting sites and worst of all, stolen bank data. It feels so horrible when you’re trying to look for users to follow (who have similar interests like yours), and you see stuff like these.

Here’s a few example:

https://codeberg.org/roofingrecoveryflorida1

https://codeberg.org/acnetreatment14

https://codeberg.org/tanuoberoy55

At the time of my writing, there are about 94,520 (4726 pages * 20 per page) accounts. On average, of the total 20 accounts, I’ve found out about six to eight spam accounts per page, but they’re not consistent - meaning that you’ll find these patterns after every four to five pages.

To understand how bad this situation is, if you sort users by newest, pages 1 (Feb 19, 2024) to 4300 (March 24, 2021) have some sort of spam account, meaning. I did not check after 4300, but there could be some spam accounts after that. Yes, I did not check all the pages, because it is impossible to read them all. I kept jumping pages in the multiple of 100s to see if the closest pages +/-10 have spams. I’m suspecting that almost one-third of the accounts are just made by bots, which lie dormant.

I am not an expert on dealing with spam, but here’s a few ideas I have in mind. Spam accounts have the pattern <company_name><number>`. But that’s just not the best way to block users, as usernames have numbers. However, what they also seem to be doing is that they want to advertise services - so they fill the user description with service description. Then there’s also images, and URL.

When it comes to escort girl spam account, their websites have the same URL as their name - most of which are from India, pointing to Gurgaon/Gurugram. For example, this account Sonia Mittal - she is apparently “Indian”, but has a stolen profile picture of some random white girl in a choker, and the website is mentioned as soniamittal.in (spoiler: the model is Angelina Danilova, apparently she’s some Russian celebrity). So, in this case, what they do is identity theft. And I’m pretty sure there’s some API available for this?

I think that there’s enough data to prove if a given account is spam, right? A majority of these accounts have no repository, btw. So, that could also be the criteria to get rid of spam accounts? A few may have them, but it will be filled with stolen data. Because there’s a word limit on profile description, simple keyword extraction models which are performant can be used here, right?