To preface, I’m currently rewriting a personal webapp to use MySQL instead of storing everything in hundreds of JSON files. I’m currently in the testing phase of generating tables with the data from the JSON files, destroying the tables, adding more columns and data, repeat, all to make sure everything is working as intended.
My issue is that occasionally I’ll create too many columns and then I get an error saying something about the row being too large? I’ve also noticed that if I change the parameters of what data is allowed to go in the column, I can generate more columns. I know there is some relationship between number of columns, the data that can go in a column, data size, and row size but I don’t know what’s going on. I’d appreciate it if someone could broadly go over how row length(?) can affect number of columns.
Thank you
The answer is complicated, but TL;DR, MySQL has a hard coded limit of 65,535 bytes (that’s 2^16 - 1, or 64k) per row. This excludes column types ‘text’ and ‘blob,’ which your storage engine will store separately. Each type of column you have requires a certain amount of storage space. For instance, this is why on a ‘varchar’ column you have to specify how many characters it can hold. Longer fields, more data required. Add all this up – probably plus some overhead or other quirks I don’t know off the top of my head – and if the result is more than 64k per row, you’re toast.
MySQL supports multiple types of storage engines, and they can be different for each table. In general, InnoDB is the best one to use for most purposes (I’m sure I’ll catch flak for this opinion) but each one technically has subtle pros and cons. InnoDB also has a column limit of 1017 columns per table, regardless of their composition. Exceed this, you are also toast.
However. If your database design is such that you are running up against row storage limits, you’re probably doing it wrong. You should reconsider how you’re storing data and whether everything in whatever your app considers a single logical ‘record’ actually all has to be in the same table. (It probably doesn’t.) So what are you doing that requires creating more than 1017 or greater than 64k of data per row?
There is a complete, but mildly baffling per usual, explanation of MySQL’s column count and row size limits here: https://dev.mysql.com/doc/refman/8.0/en/column-count-limit.html
For instance, this is why on a ‘varchar’ column you have to specify how many characters it can hold.
That’s likely what I’m doing wrong. I have roughly 220 columns and 95% of them were set to varchar(255) not understanding the size restrictions. A few were set to text(1000) as well (but these don’t count?). I definitely don’t need 255 characters worth of data in all fields
This excludes column types ‘text’ and ‘blob,’ which your storage engine will store separately.
are there any storage limits when using text or blob? I assume that the lookups are slower?
So what are you doing that requires creating more than 1017 or greater than 64k of data per row?
64 kilobytes, correct?
And I suppose it’s a glorified project book generator. 220ish variables per project and once it’s all entered into the webapp, I can generate a few hundred page PDF. Each variable of info isn’t very big, most are one word variables, a few are like 3 sentences; each row worth of info fits in a max 15KB JSON file.
At the moment, I was just putting all the data of each JSON file in one table. As I progress, variables like contacts will be put in a separate contact table and I’ll reference them (once I figure out how to do that lol). Based on the way the work is split up to collect the data on each project, I could logically split the table into 3 but I was hoping to avoid that just to reduce complexity and have everything, minus contacts, easily viewable on one table
220 columns in one table? Yeah, you definitely need to do some research and learn how to design a schema that isn’t based on incredibly wide tables.
If you had to ball park it, what’s the max number of columns you would use per table?
Right now I’m considering splitting the table into 4. 3 tables per person/job and 1 for contacts.
Or, within each job there are 5ish main topics. So I could have 2 tables. One table for contacts and one table 15 columns wide but I’d store JSON in the cells. The data between each row isn’t related to another row except by contacts.
Is one option more correct than the other?
In a perfect world, I’d say 20 columns per table max, but shoot under that if possible. This isn’t always feasible, and I’ve definitely had some fat tables in some legacy apps I’ve worked on. But 220 is just unmanageable, especially if you’re doing a select * against that table ever in your app.
Thank you for the info, I appreciate it
If your variables are inconsistent in number but relatively consistent in format, and can be uniquely identified, it is probably a better idea to have a table that’s all variables – one per row. Your table structure would be something like
id
as INT or LONGINT,variable_name
as VARCHAR(x), andvariable_value
as TEXT. When you look up a record, you SELECT * FROMvariables
WHEREid
=whatever and parse the results. Note that in this case, the table will not have a primary key. You’ll be able to have more than one row with theid
of whatever, which matches the ID of your document. You can keep whatever metadata about the document in another table, which will hopefully be short.Having everything stored in text fields will not necessarily make lookups slow, but it may make retrieval of the data in them slow if they contain a lot of data and there are an awful lot of them. Especially if you retrieve it all the time when maybe you don’t have to. It will also make your app temporarily contain a lot of data in memory while it’s holding the result of the SQL call.
In SQL, finding a row (the seek or lookup) is a very different procedure to returning the data within it once found. The amount of time and CPU cycles it takes to find a given row can be quite high, especially if your tables are not efficiently designed and do not have suitable indexes, or you have to use a complex query to narrow it down. Once found, returning the data should take the same amount of time no matter what, dependent only on how much of it there is. Also, if your returned result is huge that will make it super slow if your data has to be piped over an external connection. If your app and the database live on the same machine the data transfer from database to app can be pretty fast even if the result is huge. If they’re on separate machines and that data has to be squeezed through a network connection, though, that’s going to be painful.
Consider that:
SELECT * FROM
table
WHEREid
=1and
SELECT * FROM
table
WHEREid
REGEXP(‘#^1$#’)Will result in very different lookup times despite superficially accomplishing the same thing. The second one is going to take longer. If your table is long, it will take a lot longer. You could even cause it to exceed the query time limit of your connection if you’re not careful. The first is just finding an ID by value, on a column that should (hopefully) be indexed. The second is using a regular expression to match the digit “1” as a string, which must be compared against every single value in column
id
one at a time in a full scan of every row in the table. Full table scans are slow and expensive, and you should avoid them whenever possible.
Just spitballing here, but it sounds more like you need a nosql database instead of a relational one. Those are designed to ingest store and search things like json objects.
I’ve played around with MongoDB before. Do you know if nosql can still have relationships? Meaning, about the only data that will be reused per row would be contact info. In my app’s current state with everything in JSON files, I have the same contacts rewritten 10s of times. When contact info changes which happens often enough, I have to search through all the files and update the contacts in each place. I really want there to be 1 contact table that can be referenced from.
You would create a contacts object, with a unique reference id number.
Then all your other records would point to that unique reference id number, and would not hold any contacts data.
Changes to the customer would be done in the contacts object.
Queries could be done that join both groups of data.