Recently some group published an interactive, javascript based, website, to graphically explore data broker companies. This is just one group doing similar research work in different fields. I applaud the cause, but I take issue with the format.

An organization, that is, or group that frequently needs to provide structured data. In turn, developers might want said data, in order to deliver apps.

Interactive websites seem flaky to me, since no one guarantees they will still be there two years from now. I think it is only natural that groups doing important work would do a great service to communities if they served a RESTful or GraphQL API, depending on the complexity of the data.

But even in this case, when the group stops serving the API let alone be coerced to stop, or access to the API is blocked, this great service will be discontinued. Obviously the raw data must be shared for this to work.

Lately I was thinking about these edge cases. Journalists or activists doing this type of work may lack the sophistication to structure the data in useful ways. They probably do the journalist work and then have some developer they either hire, or is part of the group, make the important backend decisions, including structuring the raw data.

Regarding the retention of the data in case the group disbands or goes away, there are some existing solutions like torrenting or IPFSing the datasets. Both methods can help the data be online forever, but what about content integrity and versions? They would still need a static webpage or something to provide the hashes, and IPFS is by its design not very well suited for versioning.

There are no clean cut guidelines on how to go about this, or at least, what is a handful of good ways to go about this, so that a current or future group can rely on to deliver this type of work.

Another idea that popped into my head is that the ecosystems of repositories and package managers are very mature in all major distributions. Structured data could be uploaded to distro repositories (including FDroid and the like), just like any other software with underlying data structures. Hashing and versioning would be then natively taken care of by existing package managers. But the question still remains, what data structure is the best for this kind of relational data, and what kind of API should best be exposed to the user.

So, if you feel like it, I would like to hear your thoughts on:

  1. Skills and preparations required by investigative teams to publish structured data to the world.
  2. Assessment of the torrenting and IPFS solutions to ensure recovery of the data in perpetuity.
  3. Assessment of the RESTful or GraphQL format to disseminate investigative data.
  4. Assessment of using established package managers and repositories to disseminate investigative data.
  5. Ideas on what should be eventually exposed to the user, who can be assumed to be a developer as well.
  6. Further comments.

I would be glad to get some feedback on these thoughts.

  • OneMeaningManyNamesOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 days ago

    I think Arise is sth I had seen and at the time motivated these thoughts. It is a bash based static site generator, that, according to its docs, it is build with the philosophy of minimal language requirements as well as other dependencies.

    I would argue that a solution like this is better than heavily nested JSON files, or a cascade of Ordered Dicts in Python, or even a db.sqlite that would require the user parse or query the data somehow. In fact, a user could retrieve the static site from their own distro package manager and run it in bash with minimal dependencies.

    I haven’t tested this solution yet, but it looks very promising as to what I originally had in mind.

    • OneMeaningManyNamesOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 days ago

      As an innocuous example of sharing data with pure bash and Arise, these people here have preserved the Trigedasleng dictionary, the fictional language from the science-fiction/young adult show The 100, after another fan site was taken down. They use a github repo as data backend, and Arise as a static-site generator for github pages. All their data are stored in lots of version controlled JSON files instead of a database. According to the authors, this democratizes the process of forking and adding data to the repository.