DSpace Statistics is fundamentally flawed (and Google Analytics is not the answer)
While it has its many flaws, DSpace’s biggest strength has always been its out-of-the-box usability; install and configure it, and all the parts are in place to store, describe, index, search and view your archived material. However, with this kind of simplicity comes the inflexibility to utilize the latest technologies because you are stuck with what DSpace is bundled with. You are also locked in to the way in which the DSpace developers have implemented certain technologies, which may result in major issues further down the track if they are not updated by the core maintainers in a timely manner.
Discovery and Statistics
One of the biggest flaws of DSpace is the bundling of Solr with the rest of the DSpace ecosystem. This tight coupling has resulted in the inability to upgrade Solr past the version bundled with DSpace.
Solr is a powerful full-text search index, which allows users to query data using natural language and returns a list of results which most closely match the search terms specified. In fact, it functions much like Google and Bing, so users are familiar with Solr even if they have never used it before.
When DSpace was developed, Solr was baked directly into the code-base and was branded as the technology “Discovery”. It was decided statistics would also be captured to Solr and thus became “Solr Statistics”.
Both technologies benefit from Solr’s powerful index and search capabilities but are hamstrung by how tightly they are integrated with the core DSpace code. However, this is minor inconvenience compared to the dire problem introduced by Solr Statistics…
The fatal flaw
The bundling of Solr with DSpace is inconvenient but it still achieves the goal of powerful indexing and search using modern search engine practices in a single package. However, DSpace’s Solr Statistics suffers from a much greater problem which could result in the complete loss of all your usage data. Why?
Because DSpace uses Solr not just as an index but also as a permanent storage system. Yet Solr was never designed to permanently store data. In fact, the ability to purge and rebuild the index quickly and easily is a core strength of the Solr software.
There ARE ways to rebuild the Solr Statistics index using DSpace log files but what happens if you don’t have access to these log files? Also, DSpace logs take up a lot of disk space; what happens if you rotate the logs or delete them when storage gets low?
DSpace does not, like most other software, store statistical information in its database. This means that if your Solr index becomes corrupt or you can no longer access it, and you don’t have the DSpace logs available, your statistics are gone forever!
A real-world problem
We recently experienced this with a migration for a customer who had previously been hosting with another provider. During migration, the provider refused to release any information except for a basic dump of the data and files from DSpace; we could not obtain a copy of the Solr Statistics index or even the DSpace logs. This refusal resulted in the loss of years of statistics, information which the customer relied upon to deliver a quality service to their end-users.
The Solution
Back in 2016 I introduced KnowledgeArc’s analytics platform which uses Matomo (previously Piwik), the industry-leading, open source statistics project. Data is stored in a MySQL database, making data permanent and easily recoverable. By separating out the statistics from the core DSpace system, users are not reliant of a single, flawed solution.
But what about Google Analytics? Wouldn’t GA make a great drop-in replacement for Solr Statistics? Google is a 3rd party entity which comes with its own problems and limitations. You are beholden to an organization which can turn off your statistics at any time for any reason and which sells your sensitive data to the highest bidder. GA also lacks some features which Matomo provides out-of-the-box.
Matomo is open source, and you own the data. You can easily back up the Matomo database and restore your statistics no matter the catastrophe. Matomo also captures download statistics by default, something GA needs to be manually configured to do.
Other options such as Plum Analytics are available but these services require a (sometimes sizeable) financial outlay. Matamo is freely available and is provided by KnowledgeArc, fully set up, for a reasonable price.
If you want to take back control of your analytics, you must use Matomo.
One Comment