Nice writeup. I’m curious why you went with chromadb and not pgvector. I haven’t built a rag system myself, but I’ve always understood the initial doc parsing to be a major challenge alone, so kudos there!
Additionally, I also thought it was customary to store a pointer to the source in the same row as the vector (i.e. vector+ doc path + page#/paragraph/etc.) OR just store the original text chunk (though based on your disk reqs doesn’t sound like it would have been feasible).
Glad you’re having good results! Maybe you’ve inspired me to finally try out a similar setup myself!
Additionally, I also thought it was customary to store a pointer to the source in the same row as the vector (i.e. vector+ doc path + page#/paragraph/etc.) OR just store the original text chunk (though based on your disk reqs doesn’t sound like it would have been feasible).
Glad you’re having good results! Maybe you’ve inspired me to finally try out a similar setup myself!