Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
N
NLP in diagnostic texts from nephropathology
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Maximilian Legnar
NLP in diagnostic texts from nephropathology
Commits
bc5855c7
Commit
bc5855c7
authored
Feb 18, 2025
by
max-laptop
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
updated topic-modeling-analysis.py
parent
d3fb7ac6
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
14 additions
and
6 deletions
+14
-6
preprocess.py
database_preparation/preprocess.py
+4
-0
topic-modeling-analysis.py
topic_modeling/topic-modeling-analysis.py
+10
-6
No files found.
database_preparation/preprocess.py
View file @
bc5855c7
...
...
@@ -17,6 +17,10 @@ import pandas as pd
'''
# installed: nltk, Hanta, tqdm, numpy
todo: add custom preprocessing for short diagnose texts:
- replace: [('
\n
', ' '), ('DMGS', 'DM GS'), ('FGFSGS', 'FG FSGS'), ('-', ' ')]
- remove: ['(schner Fall)', 'mit', 'bei', 'nach', 'wohl', 'und']
-
'''
########## define enums ##########
...
...
topic_modeling/topic-modeling-analysis.py
View file @
bc5855c7
...
...
@@ -3,21 +3,25 @@ import joblib
import
topicwizard
from
sklearn.decomposition
import
NMF
from
sklearn.pipeline
import
make_pipeline
import
pandas
as
pd
'''
isntallation:
pip install topic-wizard
'''
# params:
path2corpus
=
"data/bow_short_diag/bow_short_diag.df.pkl"
path2corpus
=
"data/bow_diag_clustering/bow_diag.df.pkl"
if
__name__
==
'__main__'
:
vectorizer
=
CountVectorizer
(
min_df
=
5
,
max_df
=
0.8
,
stop_words
=
"english"
)
vectorizer
=
CountVectorizer
(
min_df
=
5
,
max_df
=
0.8
)
model
=
NMF
(
n_components
=
10
)
topic_pipeline
=
make_pipeline
(
vectorizer
,
model
)
corpus_df
=
pd
.
read_pickle
(
path2corpus
)
corpus_dict
=
{
case_id
:
corpus_df
.
loc
[
corpus_df
[
'case_id'
]
==
case_id
,
'preprocessed_text'
]
.
values
[
0
]
for
case_id
in
corpus_df
[
'case_id'
]}
corpus
=
[
' '
.
join
(
report
)
for
report
in
corpus_df
[
'preprocessed_text'
]
.
tolist
()]
topic_pipeline
.
fit
(
corpus
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment