Kora: Synthetic Healthcare Data Generation
Team consisting of Leonce (Senior Data Modeler, HCA & Sony Music), Karthik (Data/ML Engineer, Michigan Tech), Henry (Vanderbilt ML/NLP researcher), and Varun — Python, SQL, AWS, GCP, Airflow, PySpark, RAG.
Project Description
Kora aims to solve the reproducibility crisis in science. In Healthcare research, researchers are able to make the code they used in their papers available to other researchers to reproduce their findings, but face privacy or regulatory limitations on what data they can share. What we’ve built is an application that takes a published paper, and optionally the metadata describing the dataset that the paper uses, uses semantic understanding to generate synthetic data that approximates the distribution of the real data used in the paper. We then use the DBtwin API to scale this sample data so that other researchers can reproduce the findings. One team member has a vested interest in the product given his experience doing research at Vanderbilt, and having papers that he would like other people to reproduce for validity of findings but can no longer access the data. The application is built with python using the streamlit framework, and deployed to Huggingface.
Team
Products & Tools
Additional Links
Link to Product Demo