Personal Projects & Open Source
I'm passionate about writing clean and efficient code and like to give back to the community via open source libraries.
ArchGraph
ArchGraph is a tool that generates an interactive dependency graph of your codebase from a plain-text description of its modules, submodules, and units. The graph shows the layer structure with each submodule as a box, small port symbols indicating how many incoming and outgoing dependencies it has, and layer violations highlighted in red. You can find an example visualization of a made-up e-commerce codebase here—click around!
Since ArchGraph operates on plain text, it's language-agnostic and works just as well on design docs as on actual code, which makes it a great fit for AI-assisted development, where you plan more and code less.
Nutri-SCode
Nutri-SCode is a static code analysis tool that ranks functions by their computational density: the ratio of meaningful work (arithmetic, decisions, logic) to structural volume (number of statements). Like nutritional density in food, code full of boilerplate is empty calories—it takes up space but contributes little value. Nutri-SCode surfaces these functions so you can ask whether they need to exist at all. It complements structural tools like ArchGraph and DePyTree by answering a different question: not where the code is structured badly, but where the code is doing less than its volume implies. A demo comparing several popular Python packages can be found here.
DePyTree
DePyTree is a tool to analyze internal dependencies of Python packages, both using actual imports (as described in the book Sustainable Software Architecture by Carola Lilienthal) and shared git commits (as described in the book Your Code as a Crime Scene by Adam Thornhill). A demo displaying the dependency graph of the Python FastAPI framework can be found here.
PubVis
PubVis is a web app meant to help scientists with their literature research. Instead of having to search for a specific topic, the landscape of published research can be explored visually and papers similar in content to an article of interest are just a click away. A demo of the app is running here (with PubMed articles about different cancer types) and here (with arXiv articles about machine learning). Further details on the implementation can be found in the corresponding paper.
Classify Me! Why?
To make machine learning algorithm decisions more transparent, we can use Layer-wise Relevance Propagation (LRP) to visualize the features that influenced a classification decision. The Classify Me! Why? web app gives an interactive example of how this can look like for a text classification task using scikit-learn [code].
autofeat
autofeat [code] is a Python library with a linear regression and classification model that automatically engineers and then selects non-linear features that can significantly improve the prediction performance of the model. This is especially helpful if you have small datasets and/or want to be able to interpret your model to see how each input feature influences the prediction of the target. Further information can be found in the paper or my talk at the PyCon & PyData 2019 conference in Berlin.
evolvemb
evolvemb is a small Python library for creating continuously evolving word embeddings to examine word usage changes over time. Check out the paper for more details!
nlputils
nlputils is a Python library for analyzing text documents by transforming texts into TF-IDF features, using various similarity measures to compare documents, classify them with a k-nearest-neighbors classifier, and visualize them with t-SNE. Check out the Jupyter notebook with examples!
textcatvis
textcatvis is a Python library with some tools for the exploratory analysis of text datasets. It can help you better understand a collection of texts by identifying the relevant words of the documents in some classes or clusters and visualizing them in word clouds. Some examples can be found in the corresponding paper (short or long version).
Similarity Encoder (SimEc) and Context Encoder (ConEc)
SimEc is a neural network architecture for learning low dimensional representations of data points by projecting high dimensional input data into an embedding space where some given pairwise similarities between the data points are approximated linearly. For further details have a look at the corresponding paper, my PhD thesis, or this Jupyter notebook with some examples.
ConEc is a variant of SimEc for learning word embeddings. It is a simple but powerful extension of the continuous bag-of-words (CBOW) word2vec model trained with negative sampling and can be used to easily generate embeddings for out-of-vocabulary words and better representations for words with multiple meanings. Further details are described in the corresponding paper.