Before we started working on this, we were working on an idea around data pipelines. It didn’t take off so we had to pivot mid-batch. We had less than 1 month to do user interviews, build our product, get it approved, and launch it. To start with, we knew we wanted to stay in the data space. We spent the first week talking to over 30 people at different companies. In those conversations, we noticed that sales, marketing, and operations teams constantly need to ask developers to help them export data from the database and often have to submit follow-on data requests because they forgot to add a dimension needed for analysis (they often do this by submitting Jira tickets). This is an inefficient yet surprisingly common thing at companies today. Perhaps even more surprising is that data scientists at big companies like Facebook, who have invested heavily in data infrastructure and analytics tools, often use spreadsheets as a step in their data analysis workflow. While they can pull data for analysis themselves, they too export data into CSV and then open it in a spreadsheet. This means that their data doesn’t update automatically and the process has to be repeated each time.
At first, we were considering building our own BI tool to solve this problem. However, during our conversations we noticed that people feel new-tool fatigue, especially when their companies rotate through different tools that are used for the same purpose. For instance, we’ve heard of companies going from Tableau to Looker and back to Tableau. Or from Kibana to Sisense to Looker. Sometimes it feels like companies are paying a lot of money for tools that people aren’t really using just because it’s the thing everybody “needs to have”. Each one comes with its own structure, data modeling and steep learning curve. It's overwhelming. At some point people just realize that whatever tool they learn will likely change once a new VP gets hired and wants to do things in a new way. In anticipation of that, they default to the one tool they know how to use and likely won’t be replaced soon. Spreadsheets also happen to be an elegant solution for simple calculations, quick pivots, and high level data exploration. They are elegant in a way that no $50k a year enterprise visualization tool can be - not because they can’t do those things, but because people don’t know how to do it quickly with them. It’s also hard to change workflows for people who live in spreadsheets. Spreadsheets are a pretty good for ad-hoc analysis and summaries that are used for presentations and reports.
So instead of another BI tool, we built a Google Sheets add-on that connects to a database and lets them search through tables, filter/order the data, and then load it into Sheets directly. In the future, we’ll let them schedule data refreshes for any saved query, so all their calculations and pivot tables update with the latest data from the database. We’ve built our first integration with Postgres and testing a MySQL, MongoDB, and public dataset connector. We plan to add integrations to more data sources, including data warehouses such as Snowflake and BigQuery.
We implemented it to only get read-access to the database and we don’t store connection credentials or replicate data in our database. We are focused on data analysis at the moment, but a few customers have requested features to also write into database (write access to the database). Not sure yet how to do this safely with spreadsheets or whether we should at all. We’ve heard suggestions such as database write roll-backs or some intermediate data queue that can be approved by table owners.
Other challenges include: (1) building data connectors because many data sources spits out data differently, which means we have to parse it differently to make it accessible from a uniform interface. (2) We make connections to databases and perform all queries using AWS lambda functions (thank you serverless!) Some queries are taking minutes to execute, but AWS drops connection after 29 seconds. (3) It is hard and, as HN often points out, risky to build a product that depends on Google. We're also running into Google Sheets' row limitations - still figuring out a way to work around this for large databases. We’ve been bouncing ideas around building a back-end to perform heavy computations while only displaying a sampled subset of data in the spreadsheet. We think it's worth it to plunge ahead despite all these issues, though, because having a spreadsheet interface to their data is very much what our users want. We plan to charge users a monthly fee for use of our product and we'll have a HN discount on our basic plan.
We’d love to hear about your data analysis workflows, the tools you use to do this, any problems you’ve had getting data for analysis, and of course, your thoughts and experiences on this use of spreadsheets! Questions and specific integration requests are also welcome, and if you would like to be beta users, feel free to email us at founders@castodia.com.