Profile
-
Service machine learning products in various domains, Audio & Speech, Vision, NLP, Recommendation Systems, Tabular, LLM applications in many startups.
-
Kaggle 2x Expert. the highest competition rank is top 0.1%.
-
Alternative Military Service Status : discharge (
2020/11/27 ~ 2023/09/26
)
Links
kozistr@gmail.com | |
Github | https://github.com/kozistr |
Kaggle | https://www.kaggle.com/kozistr |
https://www.linkedin.com/in/kozistr |
Challenges & Awards
Machine Learning
-
Kaggle Challenges :: Kaggle Challenges :: Competition Expert
BirdCLEF 2023 - sole, top 2% (24 / 1189), Private 0.73641 - solution (2023.)
Google - Isolated Sign Language Recognition - sole, top 5% (63 / 1165), Private 0.8377 - (2023.)
RSNA Screening Mammography Breast Cancer Detection - solo, top 1% (16 / 1687), Private 0.49 - solution (2023.)
G2Net Detecting Continuous Gravitational Waves - solo, top 2% (22 / 936), Private 0.771 - solution (2023.)
American Express - Default Prediction - solo, top 3% (135 / 4875), Private 0.80758 - solution (2022.)
Google Brain - Ventilator Pressure Prediction - team, top 1% (20 / 2605), Private 0.1171 - solution (2021.)
SIIM-FISABIO-RSNA COVID-19 Detection - solo, top 4% (47 / 1305), Private 0.612 - solution (2021.)
Shopee - Price Match Guarantee - solo, top 7% (166 / 2426), Private 0.725 (2021.)
Cornell Birdcall Identification - team, top 2% (24 / 1395), Private 0.631 - towarddatascience (2020.)
ALASKA2 Image Steganalysis - solo, top 9% (93 / 1095), Private 0.917 (2020.)
Tweet Sentiment Extraction - solo, top 4% (84 / 2227), Private 0.71796 (2020.)
Flower Classification with TPUs - solo, top 4% (27 / 848), Private 0.98734 (2020.)
Kaggle Bengali.AI Handwritten Grapheme Classification - solo, top 4% (67 / 2059), Private 0.9372 (2020.)
Kaggle Kannada MNIST Challenge - solo, top 3% (28 / 1214), Private 0.99100 (2019.)
-
NAVER NLP Challenge :: NAVER NLP Challenge 2018
Final - Semantic Role Labeling (SRL) 6th place - oral presentation
-
A.I R&D Challenge :: A.I R&D Challenge 2018
Final - Fake or Real Detection - as Digital Forensic Team
-
NAVER A.I Hackathon :: NAVER A.I Hackathon 2018
-
TF-KR Challenge :: Facebook TF-KR MNIST Challenge
TF-KR MNIST Challenge - Top 9, 3rd price, ACC 0.9964
Hacking
-
Boot2Root CTF 2018 :: 2nd place (Demon + alpha)
-
Harekaze CTF 2017 :: 3rd place (SeoulWesterns)
-
WhiteHat League 1 (2017) :: 2nd place (Demon)
- Awarded by 한국정보기술연구원 Received an award of $3,000
Work Experience
- 2023 - 2024 : Joined Sionic AI. Built enterprise-grade LLM applications.
- 2021 - 2023 : Joined Viva Republica (Toss). Developed many products like BNPL, CSS, OCR, NPS, CDP, and in-house products.
- 2020 - 2021 : Joined Watcha. Developed Watcha recommendation system, and contributed to other products like WatchaPedia, in-house applications.
- 2019 - 2020 : Joined Rainist (Banksalad). Developed a transaction classifier service to analyze the categories with low latency, high accuracy, and in real-time.
- 2019 : Joined VoyagerX. Developed a speaker diarization product that automatically recognizes the contents of the meeting.
- - 2019 : offensive security stuffs. Mainly researched and studied Linux kernel exploitation and reverse engineering.
Company
Machine Learning Engineer, Sionic AI, (2023.10.23 ~ 2024.07.05)
- Search engine and LLM applications based on RAG for B2B products.
- Developed advanced RAG algorithm that accurately handles multi-turns, huge and lots of documents, and is cost-efficient.
- Developed multi/cross-lingual text embedding and re-ranker models, which perform well in Korean.
- Developed and maintained backend services such as backends for business logic, model inference engines, and VectorDB.
- Worked as a full-time (early start-up member)
Data Scientist, Toss core, (2021.12.06 ~ 2023.09.27)
- Develop TPS (Toss Profile Service) product.
- Various models to boost Loan Comparison products.
- Developed a CSS model only with non-financial data and it outperformed by about ~ 4%p (on the primary metric) compared with the previous method.
- Developed models to predict loan approval and interest rate.
- Various CSS models for the CB (Credit Bureau).
- Developed a more accurate & robust CSS model that mainly targets the thin-filer and it outperformed about 15% compared with the previous method.
- Developed a model that predicts consumer proposal status.
- Developed a transaction classifier with finance-relevant category to utilize at the feature engineering to boost the performance of CSS model.
- Classify the category of the user review for the NPS (Net Performer Score) product.
- Developed the RESTful API server to infer the deep learning model for the batch job.
- Saved analysis time and labor of the NPS team.
- OCR model to break captchas for the automation product.
- Developed the lightweight models (text detector & captcha classifier) for inference in real-time (about
1000 TPS
for a batch transaction,80 ~ 100 TPS
for a sample on the CPU) and built the RESTful API server to serve the model in real-time on the CPU. - In the A/B test,
google vision OCR
vsNew Captcha Model
- Accuracy (top1) : improved 50%p (
45%
to95%
) - latency (p95) : reduced by 80x (about
1000ms
to12ms
) - Revenue reduced cost by about $7,000 ~ / year
- It also elaborates on decreasing a funnel and increasing user conversion.
- Accuracy (top1) : improved 50%p (
- Developed the lightweight models (text detector & captcha classifier) for inference in real-time (about
- User consumption forecasting model for the *CDP Product.
- Developed the Transformer based sequential model that predicts what the users will consume in the next month.
- Built an efficient pipeline to process and train lots of tabular data.
- CSS model for BNPL (Buy Now Pay Later) service.
- Developed the CSS model (default prediction), mainly targeted to the thin-filer. The new model achieved the targeted default rate of about 1%.
- Developed the explainer to describe which factors affect the rejection.
- Transaction category classification model to boost the advertisement.
- Developed the ads category classifier that increases revenue in a roundabout way.
- Internal product, The Slack bots that summarize the long threads.
- help people to understand the context quickly with minimum effort.
- summarize the weekly mail using ChatGPT with prompt engineering.
- Worked as full-time.
% *CDP
: Customer Data Platform. Lots of user segments generated by the ML models.
Machine Learning Researcher, Watcha, (2020.06.22 ~ 2021.12.03)
-
Watcha recommendation system to offer a better user experience and increase
paid conversion
.- Developed the advanced the training recipe & architecture to improve training stability and the performance. Also, working on post-processing to recommend unseen content to users. In the A/B test, the new model boosts the Click Ratio by about 1.01%+.
- Developed the network to capture the active time of user while the augmentations bring the training stability and performance gain. In the A/B/C test, the new model beats
Div2Vec
in the online metrics while achieving comparable performance with the previous model (A: Div2Vec, B: the previous model, C: the new model).- *Viewing Days (mean): improved 1.012%+
- *Viewing Minutes (median): improved 1.015%+
- Developed the sequential recommendation architecture to recommend what content to watch next. It achieved SOTA performance compared to the previous SOTA architecture like BERT4Rec. In the A/B test, the new model outperforms by the following metrics (A: previous algorithm, B: the new model).
- Paid Conversion : improved 1.39%p+
- *Viewing Days (mean): improved 0.25%p+
- *Viewing Minutes (median): improved 4.10%p+
- Click Ratio : improved 4.30%p+
- Play Ratio : improved 2.32%p+
-
Face recognition architecture to find actors from the poster & still-cut images for the Watcha Pedia product.
- Developed the pipeline to identify & recognizing actor faces from the images with the face detection & identification deep learning models (similarity-based searching).
- Built a daily job that runs on the CPU. Also, optimize CPU-intensive operations to run fast.
-
Internal product to predict expected users' view-time of the content.
- Before the content is imported, the model gives an insight into the valuation of the content, like expected view-time affecting the cost of the content.
-
Internal product to help designer's works
- Developed the image super-resolution model to upscale the image more accurately and faster (e.g., waifu).
-
Music recommendation system for
Watcha Music
(prototype) -
Worked as full-time.
-
%
*Viewing Days
: how many days users are active on an app each month. -
%
*Viewing Minutes
: how many minutes the user watched the content.
Machine Learning Engineer, Rainist, (2019.11.11 ~ 2020.06.19)
- Transaction category classification application to identify the category for the convenience of user experience.
- Developed the lightweight transaction category classification model. In the A/B test, the new model achieved 25 ~ 30%p+
*Accuracy
improvement. - Developed the backends (e.g., model serving, business logic microservices) in Python.
- Utilized inference-aware framework (ONNX) to goal stable and low latency.
- Achieved a target latency of about 7 ~ 10 TPS (p50) while handling
1M transactions / day
(1 transaction = 100 samples).
- Developed the lightweight transaction category classification model. In the A/B test, the new model achieved 25 ~ 30%p+
- CSS model to forecast the possibility of loan overdue.
- Worked as full-time.
% *Accuracy
: how many users don't update their transactions' category.
Machine Learning Engineer, VoyagerX, (2019.01.07 ~ 2019.10.04)
Proceedings
deep learning application which automatically recognizes speakers & speeches (speaker diarization).- Developed the backend to diarize the conversation.
- Developed the lightweight speaker verification model (served at AWS Lambda).
- Developed the on/offline speaker diarization based on clustering & E2E methods.
Hair Salon
project to swap the hair with what the user wants naturally.- Developed a hair/face image segmentation model to identify hair & face accurately.
- Developed image in-painting model to detach a hair.
- Developed an I2I translation model to change the hairstyle.
- Worked as an intern.
Penetration Tester, ELCID, (2016.07 ~ 2016.08)
- Penetrated the network firewall and anti-virus products.
- Worked as a part-time job.
Out Sourcing
- Developed Korean University Course Information Web Parser (about 40 universities). 2 times, (2017.07, 2018.03)
- Developed AWS CloudTrail logger analyzer. (2019.09 ~ 2019.10)
Lab
HPC Lab, KoreaTech, Undergraduate Researcher, (2018.09 ~ 2018.12)
- Wrote a paper about an improved TextCNN model to predict a movie rate.
Publications
Paper
[1] Kim et al, CNN Architecture Predicting Movie Rating, 2020. 01.
- Wrote about the CNN Architecture, which utilizes a channel-attention method (SE Module) to the TextCNN model, bringing performance gain over the task while keeping its latency, generally.
- Handling un-normalized text with various convolution kernel sizes and spatial dropout
- Selected as one of the
highlight papers
for the first half of 2020
Conferences/Workshops
[1] kozistr_team
, presentation NAVER NLP Challenge 2018 SRL Task
- SRL Task, challenging w/o any domain knowledge. Presented about trials & errors during the competition
Journals
[1] zer0day, Windows Anti-Debugging Techniques (CodeEngn 2016) Sep. 2016. PDF
- Wrote about lots of anti-reversing / debugging (A to Z) techniques avail on window executable binary
Posts
[1] kozistr (as a part of team, Dragonsong
) towarddatascience
- Wrote about audio classifier with deep learning based on the Kaggle challenge where we participated
Personal Projects
Machine/Deep Learning
Generative Models
-
GANs-tensorflow :: Lots of GAN :: Generative Adversary Networks
- ACGAN-tensorflow :: Auxiliary Classifier GAN in tensorflow :: code
- StarGAN-tensorflow :: Unified GAN for multi-domain :: code
- LAPGAN-tensorflow :: Laplacian Pyramid GAN in tensorflow :: code
- BEGAN-tensorflow :: Boundary Equilibrium in tensorflow :: code
- DCGAN-tensorflow :: Deep Convolutional GAN in tensorflow :: code
- SRGAN-tensorflow :: Super-Resolution GAN in tensorflow :: code
- WGAN-GP-tensorflow :: Wasserstein GAN w/ gradient penalty in tensorflow :: code
- ... lots of GANs (over 20) :)
Super Resolution
-
Single Image Super Resolution :: Single Image Super-Resolution (SISR)
I2I Translation
- Improved Content Disentanglement :: tuned version of 'Content Disentanglement' in pytorch :: code
Style Transfer
-
Image-Style-Transfer :: Image Neural Style Transfer
- style-transfer-tensorflow :: Image Style-Transfer in tensorflow :: code
Text Classification/Generation
Speech Synthesis
-
Tacotron-tensorflow :: Text To Sound (TTS)
- tacotron-tensorflow :: lots of TTS models in tensorflow ::
code
- tacotron-tensorflow :: lots of TTS models in tensorflow ::
Optimizer
-
pytorch-optimizer :: optimizer & lr scheduler collections in PyTorch
- pytorch_optimizer :: pytorch-optimizer is optimizer & lr scheduler collections in PyTorch. I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas. :: code
-
AdaBound :: Optimizer that trains as fast as Adam and as good as SGD
- AdaBound-tensorflow :: AdaBound Optimizer implementation in tensorflow :: code
-
RAdam :: On The Variance Of The Adaptive Learning Rate And Beyond in tensorflow
- RAdam-tensorflow :: RAdam Optimizer implementation in tensorflow :: code
R.L
- Rosseta Stone :: Hearthstone simulator using C++ with some reinforcement learning :: code
Open Source Contributions
- syzkaller :: New Generation of Linux Kernel Fuzzer :: #575
- simpletransformers :: Transformers made simple w/ training, evaluating, and prediction possible w/ one line each. :: #290
- pytorch-image-models :: PyTorch image models, scripts, pretrained weights :: #1058, #1069
- deit :: DeiT: Data-efficient Image Transformers :: #140, #147, #148
- MADGRAD :: MADGRAD Optimization Method :: #11
- tensorflow-image-models :: TensorFlow Image Models (tfimm) is a collection of image models with pretrained weights, obtained by porting architectures from timm to TensorFlow :: #61
- onnx2tf :: Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf) :: #259
- dadaptation :: D-Adaptation for SGD, Adam and AdaGrad :: #21
- python-mastery :: Advanced Python Mastery :: #14
- text-embedding-inference :: A blazing fast inference solution for text embeddings model :: #62, #285, #343, #360, #361, #441
- langchain-ai :: Build context-aware reasoning applications :: #18839, #20057
- qdrant :: Qdrant - High-performance, massive-scale Vector Database for the next generation of AI :: #3982
- bfb :: high-load benchmarking tool :: #37
- qdrant-web-ui :: Self-hosted web UI for Qdrant :: #191
Plug-Ins
IDA-pro plug-in - Golang ELF binary (x86, x86-64), RTTI parser
- Recover stripped symbols & information and patch byte-codes for being able to hex-ray
Security, Hacking
CTFs, Conferences
- POC 2016 Conference Staff
- HackingCamp 15 CTF Staff, Challenge Maker
- CodeGate 2017 OpenCTF Staff, Challenge Maker
- HackingCamp 16 CTF Staff, Challenge Maker
- POX 2017 CTF Staff, Challenge Maker
- KID 2017 CTF Staff, Challenge Maker
- Belluminar 2017 CTF Staff
- HackingCamp 17 CTF Staff, Challenge Maker
- HackingCamp 18 CTF Staff, Challenge Maker
Teams
Hacking Team, Fl4y. Since 2017.07 ~
Hacking Team, Demon by POC. Since 2014.02 ~ 2018.08
Educations
BS in Computer Engineering from KUT
Presentations
2018
[2] Artificial Intelligence ZeroToAll, Apr 2018.
[1] Machine Learning ZeroToAll, Mar 2018.
2015
[1] Polymorphic Virus VS AV Detection, Oct 2015.
2014
[1] Network Sniffing & Detection, Oct, 2014.