Danswer (YC W24) – Open-source AI search and chat over private data

Read Post

Hey HN! Chris and Yuhong here from Danswer (https://github.com/danswer-ai/danswer). We’re building an open source and self-hostable ChatGPT-style system that can access your team’s unique knowledge by connecting to 25 of the most common workplace tools (Slack, Google Drive, Jira, etc.). You ask questions in natural language and get back answers based on your team’s documents. Where relevant, answers are backed by citations and links to the exact documents used to generate them.

Quick Demo: https://youtu.be/hqSouur2FXw

Originally Danswer was a side project motivated by a challenge we experienced at work. We noticed that as teams scale, finding the right information becomes more and more challenging. I recall being on call and helping a customer recover from a mission critical failure but the error was related to some obscure legacy feature I had never used. For most projects, a simple question to ChatGPT would have solved it; but in this moment, ChatGPT was completely clueless without additional context (which I also couldn’t find).

We believe that within a few years, every org will be using team-specific knowledge assistants. We also understand that teams don’t want to tell us their secrets and not every team has the budget for yet another SaaS solution, so we open-sourced the project. It is just a set of containers that can be deployed on any cloud or on-premise. All of the data is processed and persisted on that same instance. Some teams have even opted to self-host open-source LLMs to truly airgap the system.

I also want to share a bit about the actual design of the system (https://docs.danswer.dev/system_overview). If you have questions about any parts of the flow such as the model choice, hyperparameters, prompting, etc. we’re happy to go into more depth in the comments.

The system revolves around a custom Retrieval Augmented Generation (RAG) pipeline we’ve built. During indexing time (we pull documents from connected sources every 10 minutes), documents are chunked and indexed into hybrid keyword+vector indices (https://github.com/danswer-ai/danswer/blob/main/backend/dans...).

For the vector index (which gives the system the flexibility to understand natural language queries), we use state of the art prefix-aware embedding models trained with contrastive loss. Optionally the system can be configured to go over each doc with multiple passes of different granularity to capture wide context vs fine details. We also supplement the vector search with a keyword based BM25 index + N-Grams so that the system performs well even in low data domains. Additionally we’ve added in learning from feedback and time based decay—see our custom ranking function (https://github.com/danswer-ai/danswer/blob/main/backend/dans... – this flexibility is why we love Vespa as a Vector DB).

At query time, we preprocess the query with query-augmentation, contextual-rephrasing, as well as standard techniques like removing stopwords and lemmatization. Once the top documents are retrieved, we ask a smaller LLM to decide which of the chunks are “useful for answering the query” (this is something we haven’t seen much of elsewhere, but our tests have shown to be one of the biggest drivers for both precision and recall). Finally the most relevant passages are passed to the LLM along with the user query and chat history to produce the final answer. We post-process by checking guardrails and extracting citations to link the user to relevant documents. (https://github.com/danswer-ai/danswer/blob/main/backend/dans...)

The Vector and Keyword indices are both stored locally and the NLP models run on the same instance (we’ve chosen ones that can run without GPU). The only exception is that the default Generative model is OpenAI’s GPT, however this can also be swapped out (https://docs.danswer.dev/gen_ai_configs/overview).

We’ve seen teams use Danswer on problems like: Improving turnaround times for support by reducing time taken to find relevant documentation; Helping sales teams get customer context instantly by combing through calls and notes; Reducing lost engineering time from answering cross-team questions, building duplicate features due to inability to surface old tickets or code merges, and helping on-calls resolve critical issues faster by providing the complete history on an error in one place; Self-serving onboarding for new members who don’t know where to find information.

If you’d like to play around with things locally, check out the quickstart guide here: https://docs.danswer.dev/quickstart. If you already have Docker, you should be able to get things up and running in <15 minutes. And for folks who want a zero-effort way of trying it out or don’t want to self-host, please visit our Cloud: https://www.danswer.ai/

Danswer (YC W24) – Open-source AI search and chat over private data

Get Top 5 Posts of the Week